Implementing Communication-Avoiding Algorithms

Post on 24-Feb-2016

48 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Implementing Communication-Avoiding Algorithms. Jim Demmel EECS & Math Departments UC Berkeley. Why avoid communication? . Communication = moving data Between level of memory hierarchy Between processors over a network Running time of an algorithm is sum of 3 terms: - PowerPoint PPT Presentation

Transcript

ImplementingCommunication-Avoiding Algorithms

Jim DemmelEECS amp Math Departments

UC Berkeley

Why avoid communication

bull Communication = moving datandash Between level of memory hierarchyndash Between processors over a network

bull Running time of an algorithm is sum of 3 termsndash flops time_per_flopndash words moved bandwidthndash messages latency

2

communication

bull Time_per_flop ltlt 1 bandwidth ltlt latencybull Gaps growing exponentially with time [FOSC]

bull Avoid communication to save timebull Same story for energy

bull Avoid communication to save energy

Goals

3

bull Redesign algorithms to avoid communicationbull Between all memory hierarchy levels

bull L1 L2 DRAM network etc bull Attain lower bounds if possible

bull Current algorithms often far from lower boundsbull Large speedups and energy savings possible

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Lower bound for all ldquon3-likerdquo linear algebra

bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no

matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)

6

bull Let M = ldquofastrdquo memory size (per processor)

words_moved (per processor) = (flops (per processor) M12 )

messages_sent (per processor) = (flops (per processor) M32 )

bull Parallel case assume either load or memory balanced

Lower bound for all ldquon3-likerdquo linear algebra

bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no

matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)

7

bull Let M = ldquofastrdquo memory size (per processor)

words_moved (per processor) = (flops (per processor) M12 )

messages_sent ge words_moved largest_message_size

bull Parallel case assume either load or memory balanced

Lower bound for all ldquon3-likerdquo linear algebra

bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no

matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)

8

bull Let M = ldquofastrdquo memory size (per processor)

words_moved (per processor) = (flops (per processor) M12 )

messages_sent (per processor) = (flops (per processor) M32 )

bull Parallel case assume either load or memory balanced

SIAM SIAGLinear Algebra Prize 2012Ballard D Holtz Schwartz

Limits to parallel scaling (12)bull Consider dense case flops_per_proc = n3P

ndash Words = (n3(PM12 ))ndash Messages = (n3(PM32 ))

bull What is M Must be at least n2P to hold datandash Words = (n2P12 )ndash Messages = (P12 )

bull But if M fixed looks like perfect strong scaling in timendash Flops Words Messages all proportional to 1P

bull Ditto for energy if we count energy costs in joules hellipndash Per flop per word moved per messagendash Per word per second for data stored in memory Mndash Per second for leakage cooling hellip

bull How big can we make P and M

Limits to parallel scaling (22)

bull Consider dense case flops_per_proc = n3Pndash Words = (n3(PM12 ))ndash Messages = (n3(PM32 ))

bull How big can we make P and Mbull Assume we start with 1 copy of inputs A and B

ndash Otherwise no communication may be neededbull Thm Words= (n2P23 ) independent of M

bull Reached when M = n2P23 too or P = n3M32 and Messages = (1) (log P in practice)bull Attained by 25D algorithm when c=P13 (ldquo3D algrdquo)bull Can keep increasing P until P = n3 Words = Messages = (1) (log n in practice)

Can we attain these lower bounds

bull Do conventional dense algorithms as implemented in LAPACK and ScaLAPACK attain these boundsndash Often not

bull If not are there other algorithms that dondash Yes for much of dense linear algebrandash New algorithms with new numerical properties

new ways to encode answers new data structures

ndash Not just loop transformations (need those too)bull Only a few sparse algorithms so farbull Lots of work in progress

ndash Algorithms Energy Heterogeneous Processors hellip11

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

25D Matrix Multiplication

bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid

c

(Pc)12

(Pc)12

Example P = 32 c = 2

25D Matrix Multiplication

bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid

k

j

iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12

(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)

(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)

25D Matmul on BGP 16K nodes 64K coresc = 16 copies

Distinguished Paper Award EuroParrsquo11 (Solomonik D)SCrsquo11 paper by Solomonik Bhatele D

12x faster

27x faster

Perfect Strong Scaling ndash in Time and Energy (12)

bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2

bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model

ndash γT βT αT = secs per flop per word_moved per message of size m

bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ] = T(P)cbull Notation for energy model

ndash γE βE αE = joules for same operations

ndash δE = joules per word of memory used per sec

ndash εE = joules per sec for leakage etc

bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP) = E(P)bull Perfect scaling extends to N-body Strassen hellip

Perfect Strong Scaling ndash in Time and Energy (22)

bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ] = T(P)cbull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP) = E(P)

bull Can use these formulas to answer many questions such asndash How to choose p and M to minimize energy E needed for computationndash Given max allowed runtime T what is minimum energy E needed to achieve itndash Given max allowed energy E what is the minimum runtime T attainablendash Can we minimize the average power P = ETndash Given target energy efficiency what architectural parameters are needed to

achieve itbull Can we attain 75 GflopsWattbull Can we attain an exaflop for 20 MWatts

Handling Heterogeneitybull Suppose each of P processors could differ

ndash γi = secflop βi = secword αi = secmessage Mi = memory

bull What is optimal assignment of work Fi to minimize timendash Ti = Fi γi + Fi βi Mi

12 + Fi αi Mi32 = Fi [γi + βi Mi

12 + αi Mi32] = Fi ξi

ndash Choose Fi so Σi Fi = n3 and minimizing T = maxi Ti

ndash Answer Fi = n3(1ξi)Σj(1ξj) and T = n3Σj(1ξj)

bull Optimal Algorithm for nxn matmulndash Recursively divide into 8 half-sized subproblemsndash Assign subproblems to processor i to add up to Fi flops

bull Works for Strassen other algorithmshellip

Application to Tensor Contractions

bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

bull Heavily used in electronic structure calculationsndash Ex NWChem

bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetries

ndash Solomonik Hammond Matthews

C(ijk) = Σm A(ijm)B(mk)

A3-fold symm

B2-fold symm

C2-fold symm

Application to Tensor Contractions

bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

bull Heavily used in electronic structure calculationsndash Ex NWChem for coupled cluster (CC) approach to Schroedinger eqn

bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetriesndash Up to 3x faster running CC than NWChem on 3072 cores of Cray XE6 ndash Solomonik Hammond Matthews

Communication Lower Bounds for Strassen-like matmul algorithms

bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

bull Extends up to M = n2 p2ω bull Extends to rectangular case multiply (mxn)(nxp) in q mults

ndash words_moved = Ω (flopsM^(logmpq -1))

bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz also in JACMbull Is the lower bound attainable

Classical O(n3) matmul

words_moved =Ω (M(nM12)3P)

Strassenrsquos O(nlg7) matmul

words_moved =Ω (M(nM12)lg7P)

Strassen-like O(nω) matmul

words_moved =Ω (M(nM12)ωP)

vs

Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

Communication Avoiding Parallel Strassen (CAPS)

Best way to interleaveBFS and DFS is an tuning parameter

26

Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

Speedups 24-184(over previous Strassen-based algorithms)

Invited to appear as Research Highlight in CACM

Strassen-like beyond matmul

bull Thm (D Dumitriu Holtzrsquo07) Any Strassen-like O(nω) matmul algorithm can be used to build a numerically stable O(nω+η) algorithm for any ηgt0 for Ax=b least squares eig SVD hellipndash ηgt0 needed to deal with numerical stabilityndash Strassen already stable so η=0

bull Thm For sequential versions of these algorithms Words_moved = O(nω+ηM(ω+η)2 ndash 1 + n2 log n) ie attain expected lower bound

Ballard D Holtz Schwartz

Cache and Network Oblivious Algorithms

bull Motivation Minimizes communication at every level of a hierarchical system without tuning parameters (in theory)ndash Not always 25D Matmul on BGP was topology aware

bull CAPS Divide-and-conquer choose BFS or DFS to adapt to processors available memory

bull CARMAndash Divide-and-conquer classical matmul divide largest of 3

dimensions to create two subproblemsndash Choose BFS or DFS to adapt to processors available memory

CARMA Performance Distributed Memory

Square m = k = n = 6144

ScaLAPACK

CARMA

Peak

(log)

(log)

Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA

CARMA Performance Distributed Memory

Inner Product m = n = 192 k = 6291456

ScaLAPACK

CARMAPeak

(log)

(log)

Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA

CARMA Performance Shared Memory

Square m = k = n

MKL (double)CARMA (double)

MKL (single)CARMA (single)

Peak (single)

Peak (double)

(log)

(linear)

Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

CARMA Performance Shared Memory

Inner Product m = n = 64

MKL (double)

CARMA (double)

MKL (single)

CARMA (single)

(log)

(linear)

Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

Why is CARMA Faster in Shared MemoryL3 Cache Misses

Shared Memory Inner Product (m = n = 64 k = 524288)

97 Fewer Misses

86 Fewer Misses

(linear)

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

One-sided Factorizations (LU QR) so farbull Classical Approach for i=1 to n update column i update trailing matrixbull words_moved = O(n3)

35

bull Blocked Approach (LAPACK) for i=1 to nb update block i of b columns update trailing matrixbull words moved = O(n3M13)

bull Recursive Approach func factor(A) if A has 1 column update it

else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)

bull None of these approaches minimizes messagesbull Parallel case Partial

Pivoting =gt n reductionsbull Need another idea

TSQR An Architecture-Dependent Algorithm

W =

W0

W1

W2

W3

R00

R10

R20

R30

R01

R11

R02Parallel

W =

W0

W1

W2

W3

R01 R02

R00

R03

SequentialStreaming

W =

W0

W1

W2

W3

R00

R01R01

R11

R02

R11

R03

Dual Core

Can choose reduction tree dynamicallyMulticore Multisocket Multirack Multisite Out-of-core

Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

Wnxb =

W1

W2

W3

W4

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

=

Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

W1rsquoW2rsquoW3rsquoW4rsquo

P12middotL12middotU12

P34middotL34middotU34

=Choose b pivot rows call them W12rsquo

Choose b pivot rows call them W34rsquo

W12rsquoW34rsquo

= P1234middotL1234middotU1234

Choose b pivot rows

Go back to W and use these b pivot rows (move them to top do LU without pivoting)

37

Minimizing Communication in TSLU

W = W1

W2

W3

W4

LULULULU

LU

LULUParallel

W = W1

W2

W3

W4

LULU

LU

LUSequentialStreaming

W = W1

W2

W3

W4

LULU LU

LULU

LULU

Dual Core

Can choose reduction tree dynamically to match architecture as before

38

Making TSLU Numerically Stable

bull Details matterndash Going up the tree we could do LU either on original rows of A

(tournament pivoting) or computed rows of Undash Only tournament pivoting stable

bull ldquoThmrdquo New scheme as stable as Partial Pivoting (GEPP) in following sense Get same Schur complements as GEPP applied to different input matrix whose entries are blocks taken from input A

bull Why just a ldquoThmrdquo

39

Stability of LU using TSLU CALU

Summer School Lecture 4 40

bull Empirical testingndash Both random matrices and ldquospecial onesrdquondash Both binary tree (BCALU) and flat-tree (FCALU)ndash 3 metrics ||PA-LU||||A|| normwise and componentwise backward errorsndash See [D Grigori Xiang 2010] for details

Why is stability of TSLU just a ldquoThmrdquo

bull Proof is correct ndash in exact arithmeticbull Experiment

ndash Generate 100 random 6x6 rank 3 matrices in Matlabndash [LUP] = lu(A) do LU without pivoting on PA compare L factors are

they the samebull Compute || L ndash Lnp || A few 0rsquos A few infinrsquos a few NaNsbull Rest mostly O(1)

ndash Why Floating point is nonassociative doing arithmetic in different order gives different rounding errors

ndash Same experiment with rank 6 matrices || L ndash Lnp || usually nonzero O(macheps)

ndash Same experiment with 20x20 rank 4 matrices || L ndash Lnp || often O(103)

bull Much harder to break TSLU but possiblendash Occurred when using TSLU to factorize a low-rank subdiagonal

panel in symmetric-indefinite factorization 41

Fixing TSLU

bull Run TSLU quickly test for stability fix if necessary (rare)

bull Test conditioning of U if not tiny (usual case) proceed elsebull Compute || L || if not big (usual case) proceed elsebull Factor A = QR using TSQR thenbull Factor Q = PLU using TSLU thenbull A = PL(UR) with UR as upper triangular factor

bull Last topic in lecture how to guarantee floating point reproducibility

42

2D CALU with Tournament Pivoting

43

25D CALU with Tournament Pivoting (c=4 copies)

44

Exascale Machine ParametersSource DOE Exascale Workshop

bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

Exascale predicted speedupsfor Gaussian Elimination

2D CA-LU vs ScaLAPACK-LU

log2 (p)

log 2

(n2 p

) =

log 2

(mem

ory_

per_

proc

)

Up to 29x

25D vs 2D LUWith and Without Pivoting

Other CA algorithms for Ax=b least squares(13)

bull A symmetric and indefinitendash Seek factorization that retains symmetry PAPT = LDLT D

ldquosimplerdquobull Save frac12 flops preserve inertia

ndash Usual approach Bunch-Kaufmanbull D block diagonal with 1x1 and 2x2 blocksbull Pivot search down column along row (lots of communication)

ndash Alternative Aasenbull D = tridiagonal = Tbull Two steps

ndash PAPT = LTLT where T is banded using TSLU

48

0 0

0

0 0

0

0

hellip

hellip

ndash Solvefactor narrow band problem with Tbull Up to 28x faster than MKL Best Paper at IPDPSrsquo13

Other CA algorithms for Ax=b least squares (23)bull Minimizing bandwidth and latency for sequential GEPP

ndash So far could not do partial pivoting and minimize messages just words

ndash Challengebull Column layout good for choosing pivots bad for matmulbull Blocked layout good for matmul bad for choosing pivots

ndash Solution use both layouts switching between thembull ldquoShape Morphing LUrdquo or SMLU

49

bull func factor(A) if A has 1 column update it else factor(left half of A)

update right half of A

factor(right half of A)

bull Words = O(n3M12)

bull Messages = O(n3M)

bull func factor(A) if A has 1 column update it else factor(left half of A) reshape to recursive block format update right half of A reshape to columnwise format factor(right half of A)

bull Words = O(n3M12)

bull Messages = O(n3M32)

Other CA algorithms for Ax=b least squares (33)bull Need for pivoting arises beyond LU in QR

ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)

ndash Usual approach like Partial Pivotingbull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound

ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two

groups of b columns either using usual approach or something better (GuEisenstat)

bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix

ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 50

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

What about sparse matrices (13)

bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

bull But canrsquot reorder outer loop for 25D need another ideabull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = AB

ndash Dependencies ok 25D works just different semiringbull Kleenersquos Algorithm

52

for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

Performance of 25D APSP using Kleene

53

Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

62xspeedup

2x speedup

What about sparse matrices (23)

bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of G(A)

have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering for

2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

separators)

54

What about sparse matrices (33)

bull If matrix stays very sparse lower bound unattainable new one bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

along dimensions most likely to minimize cost55

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Symmetric Eigenproblem and SVD

bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd

bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12

ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea

b+1

b+1

Successive Band Reduction (BischofLangSun)

1

b+1

b+1

d+1

c

Successive Band Reduction (BischofLangSun)

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1Q1

b+1

b+1

d+1

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

12

Q1

b+1

b+1

d+1

d+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

12

Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

d+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

Q1

Q1T

Q2

Q2T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

Q1

Q1T

Q2

Q2T

Q3

Q3T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

6

6

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

Conventional vs CA - SBR

Conventional Communication-Avoiding

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

Nonsymmetric Eigenproblem

bull No apparent way to modify standard algorithmbull Instead Spectral Divide-and-Conquer

ndash Find orthogonal matrix Q whose leading columns span an invariant subspace of A

ndash QTAQ will be block upper triangularndash Apply recursively to A11 A22

ndash Depends on randomization1 Randomized Rank Revealing QR decomposition2 Randomized location to try splitting spectrum

A11 A12

ε A22

Attaining the Lower bounds SequentialLegend[Existing][Ours][Math-Lib][Random]

Two Levels Memory Hierarchy

Words Messages Words Messages

BLAS-3 [FLPRrsquo99][BDLSTrsquo13][MKL etc] [FLPRrsquo99][BDLSTrsquo13][MKL etc]

Cholesky[Grsquo97][APrsquo00]

[LAPACK][BDHSrsquo09]

[Grsquo97][APrsquo00][BDHSrsquo09] [Grsquo97][APrsquo00][BDHSrsquo09]

Sym Indefinite [BBDDDPSTYrsquo13] [BBDDDPSTYrsquo13]

LU[Grsquo97][Trsquo97]

[GDXrsquo11][BDLSTrsquo13]

[GDXrsquo11][BDLSTrsquo13]

[Grsquo97][Trsquo97] [BDLSTrsquo13] [BDLSTrsquo13]

QR[EGrsquo98][FWrsquo03]

[DGHLrsquo12][BDLSTrsquo13]

[FWrsquo03][DGHLrsquo12][BDLSTrsquo13]

[EGrsquo98][FWrsquo03][BDLSTrsquo13]

[FWrsquo03][BDLSTrsquo13]

Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

Sym Eig amp SVD [BDDrsquo11][BDKrsquo13] [BDDrsquo11]

Non Sym Eig [BDDrsquo11] [BDDrsquo11]

Legend[Existing][Ours][Math-Lib][Random]

Words (BW) Messages (L) Saving factor

BLAS-3 [AGZrsquo94][MTrsquo99][ScaLAPACK][Crsquo69][vGWrsquo97][SDrsquo11] L nP12

Cholesky [ScaLAPACK][Trsquo99][SDrsquo11] L nP12

Sym Indefinite [BBDDDPSTYrsquo13][ScaLAPACK] [BBDDDPSTYrsquo13] L nP12

LU [ScaLAPACK][GDXrsquo11][Trsquo99][SDrsquo11] [GDXrsquo11][Trsquo99][SDrsquo11] L nP12

QR [ScaLAPACK][DGHLrsquo12] [Trsquo99] [DGHLrsquo12][Trsquo99] L nP12

Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

Sym Eig amp SVD [BDDrsquo11][BDKrsquo13][ScaLAPACK] [BDDrsquo11][BDKrsquo13] L nP12

Non-Sym Eig [BDDrsquo11] [BDDrsquo11] BW P12 L n

Attaining with extra memory 25D M=(cn2P)

Attaining the Lower bounds Parallel 2DM=(n2P)(Ignoring poly-log(P) factors words = ( n2 P12) messages = (P12)

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Conjugate Gradients (CG) GMRES Lanczos Arnoldi hellip bull Goal minimize communication

ndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computationndash Challenges Poor partitioning Preconditioning Num Stability

75

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Example The Difficulty of Tuning SpMV

bull n = 21200bull nnz = 15 M

bull Source NASA structural analysis problem (raefsky)

77

Example The Difficulty of Tuning

bull n = 21200bull nnz = 15 M

bull Source NASA structural analysis problem (raefsky)

bull 8x8 dense substructure exploit this to limit mem_refs

78

Speedups on Itanium 2 The Need for Search

Reference

Best 4x2

Mflops

Mflops

79

Register Profile Itanium 2

190 Mflops

1190 Mflops

80

Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16

Itanium 2 - 33Itanium 1 - 8

252 Mflops

122 Mflops

820 Mflops

459 Mflops

247 Mflops

107 Mflops

12 Gflops

190 Mflops

Another example of tuning challenges for SpMV

bull Ex11 matrix (fluid flow)

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

82

Zoom in to top corner

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

83

3x3 blocks look natural buthellip

bull Example 3x3 blockingndash Logical grid of 3x3 cells

bull But would lead to lots of ldquofill-inrdquo

84

Extra Work Can Improve Efficiency

bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

bull On Pentium III 15x speedupndash Actual mflop rate 152 = 225 higher

85

Source Accelerator Cavity Design Problem (Ko via Husbands)

86

100x100 Submatrix Along Diagonal

Summer School Lecture 787

Post-RCM Reordering

88

Effect of Combined RCM+TSP Reordering

Before Green + RedAfter Green + Blue

Summer School Lecture 789

2x speedups on Pentium 4 Power 4 hellip

Summary of Other Performance Optimizations

bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

90

Optimized Sparse Kernel Interface - OSKI

bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

91

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

93

Example Classical Conjugate Gradient (CG)

SpMVs and dot products require communication in

each iteration

via CA Matrix Powers Kernel

Global reduction to compute G

94

Example CA-Conjugate Gradient

Local computations within inner loop require

no communication

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

96

Slower convergence due

to roundoff

Loss of accuracy due to roundoff

At s = 16 monomial basis is rank deficient Method breaks down

Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

CA-CG (monomial)CG

machine precision

97

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

matrices

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

101

bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

demanded by customers (construction engineers) otherwise they donrsquot believe results

ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

Reproducible Floating Point Computation

Absolute Error for Random Vectors

Same magnitude opposite signs

Intel MKL non-reproducibility

Relative Error for Orthogonal vectors

Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

Sign notreproducible

103

bull Consider summation or dot productbull Goals

1 Same answer independent of layout processors order of summands

2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

GoalsApproaches for Reproducibility

104

Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

Instruments NEC Nokia NVIDIA Samsung Oracle

bull bebopcsberkeleyedu

Summary

Donrsquot Communichellip

106

Time to redesign all linear algebra n-body hellip algorithms and software

(and compilers)

  • Implementing Communication-Avoiding Algorithms
  • Why avoid communication
  • Goals
  • Outline
  • Outline (2)
  • Lower bound for all ldquon3-likerdquo linear algebra
  • Lower bound for all ldquon3-likerdquo linear algebra (2)
  • Lower bound for all ldquon3-likerdquo linear algebra (3)
  • Limits to parallel scaling (12)
  • Limits to parallel scaling (22)
  • Can we attain these lower bounds
  • Outline (3)
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 26
  • Strassen-like beyond matmul
  • Cache and Network Oblivious Algorithms
  • CARMA Performance Distributed Memory
  • CARMA Performance Distributed Memory (2)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster in Shared Memory
  • Outline (4)
  • One-sided Factorizations (LU QR) so far
  • TSQR An Architecture-Dependent Algorithm
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Minimizing Communication in TSLU
  • Making TSLU Numerically Stable
  • Stability of LU using TSLU CALU
  • Why is stability of TSLU just a ldquoThmrdquo
  • Fixing TSLU
  • 2D CALU with Tournament Pivoting
  • 25D CALU with Tournament Pivoting (c=4 copies)
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Other CA algorithms for Ax=b least squares(13)
  • Other CA algorithms for Ax=b least squares (23)
  • Other CA algorithms for Ax=b least squares (33)
  • Outline (5)
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Outline (6)
  • Symmetric Eigenproblem and SVD
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Nonsymmetric Eigenproblem
  • Attaining the Lower bounds Sequential
  • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
  • Outline (7)
  • Avoiding Communication in Iterative Linear Algebra
  • Outline (8)
  • Example The Difficulty of Tuning SpMV
  • Example The Difficulty of Tuning
  • Speedups on Itanium 2 The Need for Search
  • Register Profile Itanium 2
  • Register Profiles IBM and Intel IA-64
  • Another example of tuning challenges for SpMV
  • Zoom in to top corner
  • 3x3 blocks look natural buthellip
  • Extra Work Can Improve Efficiency
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Summary of Other Performance Optimizations
  • Optimized Sparse Kernel Interface - OSKI
  • Outline (9)
  • Example Classical Conjugate Gradient (CG)
  • Example CA-Conjugate Gradient
  • Outline (10)
  • Slide 96
  • Slide 97
  • Outline (11)
  • What is a ldquosparse matrixrdquo
  • Outline (12)
  • Reproducible Floating Point Computation
  • Intel MKL non-reproducibility
  • GoalsApproaches for Reproducibility
  • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
  • Collaborators and Supporters
  • Summary

    Why avoid communication

    bull Communication = moving datandash Between level of memory hierarchyndash Between processors over a network

    bull Running time of an algorithm is sum of 3 termsndash flops time_per_flopndash words moved bandwidthndash messages latency

    2

    communication

    bull Time_per_flop ltlt 1 bandwidth ltlt latencybull Gaps growing exponentially with time [FOSC]

    bull Avoid communication to save timebull Same story for energy

    bull Avoid communication to save energy

    Goals

    3

    bull Redesign algorithms to avoid communicationbull Between all memory hierarchy levels

    bull L1 L2 DRAM network etc bull Attain lower bounds if possible

    bull Current algorithms often far from lower boundsbull Large speedups and energy savings possible

    Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

    ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

    ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

    bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

    bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

    Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

    ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

    ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

    bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

    bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

    Lower bound for all ldquon3-likerdquo linear algebra

    bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no

    matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)

    6

    bull Let M = ldquofastrdquo memory size (per processor)

    words_moved (per processor) = (flops (per processor) M12 )

    messages_sent (per processor) = (flops (per processor) M32 )

    bull Parallel case assume either load or memory balanced

    Lower bound for all ldquon3-likerdquo linear algebra

    bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no

    matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)

    7

    bull Let M = ldquofastrdquo memory size (per processor)

    words_moved (per processor) = (flops (per processor) M12 )

    messages_sent ge words_moved largest_message_size

    bull Parallel case assume either load or memory balanced

    Lower bound for all ldquon3-likerdquo linear algebra

    bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no

    matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)

    8

    bull Let M = ldquofastrdquo memory size (per processor)

    words_moved (per processor) = (flops (per processor) M12 )

    messages_sent (per processor) = (flops (per processor) M32 )

    bull Parallel case assume either load or memory balanced

    SIAM SIAGLinear Algebra Prize 2012Ballard D Holtz Schwartz

    Limits to parallel scaling (12)bull Consider dense case flops_per_proc = n3P

    ndash Words = (n3(PM12 ))ndash Messages = (n3(PM32 ))

    bull What is M Must be at least n2P to hold datandash Words = (n2P12 )ndash Messages = (P12 )

    bull But if M fixed looks like perfect strong scaling in timendash Flops Words Messages all proportional to 1P

    bull Ditto for energy if we count energy costs in joules hellipndash Per flop per word moved per messagendash Per word per second for data stored in memory Mndash Per second for leakage cooling hellip

    bull How big can we make P and M

    Limits to parallel scaling (22)

    bull Consider dense case flops_per_proc = n3Pndash Words = (n3(PM12 ))ndash Messages = (n3(PM32 ))

    bull How big can we make P and Mbull Assume we start with 1 copy of inputs A and B

    ndash Otherwise no communication may be neededbull Thm Words= (n2P23 ) independent of M

    bull Reached when M = n2P23 too or P = n3M32 and Messages = (1) (log P in practice)bull Attained by 25D algorithm when c=P13 (ldquo3D algrdquo)bull Can keep increasing P until P = n3 Words = Messages = (1) (log n in practice)

    Can we attain these lower bounds

    bull Do conventional dense algorithms as implemented in LAPACK and ScaLAPACK attain these boundsndash Often not

    bull If not are there other algorithms that dondash Yes for much of dense linear algebrandash New algorithms with new numerical properties

    new ways to encode answers new data structures

    ndash Not just loop transformations (need those too)bull Only a few sparse algorithms so farbull Lots of work in progress

    ndash Algorithms Energy Heterogeneous Processors hellip11

    Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

    ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

    ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

    bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

    bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

    25D Matrix Multiplication

    bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid

    c

    (Pc)12

    (Pc)12

    Example P = 32 c = 2

    25D Matrix Multiplication

    bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid

    k

    j

    iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12

    (1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)

    (3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)

    25D Matmul on BGP 16K nodes 64K coresc = 16 copies

    Distinguished Paper Award EuroParrsquo11 (Solomonik D)SCrsquo11 paper by Solomonik Bhatele D

    12x faster

    27x faster

    Perfect Strong Scaling ndash in Time and Energy (12)

    bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2

    bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model

    ndash γT βT αT = secs per flop per word_moved per message of size m

    bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ] = T(P)cbull Notation for energy model

    ndash γE βE αE = joules for same operations

    ndash δE = joules per word of memory used per sec

    ndash εE = joules per sec for leakage etc

    bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP) = E(P)bull Perfect scaling extends to N-body Strassen hellip

    Perfect Strong Scaling ndash in Time and Energy (22)

    bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ] = T(P)cbull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP) = E(P)

    bull Can use these formulas to answer many questions such asndash How to choose p and M to minimize energy E needed for computationndash Given max allowed runtime T what is minimum energy E needed to achieve itndash Given max allowed energy E what is the minimum runtime T attainablendash Can we minimize the average power P = ETndash Given target energy efficiency what architectural parameters are needed to

    achieve itbull Can we attain 75 GflopsWattbull Can we attain an exaflop for 20 MWatts

    Handling Heterogeneitybull Suppose each of P processors could differ

    ndash γi = secflop βi = secword αi = secmessage Mi = memory

    bull What is optimal assignment of work Fi to minimize timendash Ti = Fi γi + Fi βi Mi

    12 + Fi αi Mi32 = Fi [γi + βi Mi

    12 + αi Mi32] = Fi ξi

    ndash Choose Fi so Σi Fi = n3 and minimizing T = maxi Ti

    ndash Answer Fi = n3(1ξi)Σj(1ξj) and T = n3Σj(1ξj)

    bull Optimal Algorithm for nxn matmulndash Recursively divide into 8 half-sized subproblemsndash Assign subproblems to processor i to add up to Fi flops

    bull Works for Strassen other algorithmshellip

    Application to Tensor Contractions

    bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

    bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

    bull Heavily used in electronic structure calculationsndash Ex NWChem

    bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetries

    ndash Solomonik Hammond Matthews

    C(ijk) = Σm A(ijm)B(mk)

    A3-fold symm

    B2-fold symm

    C2-fold symm

    Application to Tensor Contractions

    bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

    bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

    bull Heavily used in electronic structure calculationsndash Ex NWChem for coupled cluster (CC) approach to Schroedinger eqn

    bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetriesndash Up to 3x faster running CC than NWChem on 3072 cores of Cray XE6 ndash Solomonik Hammond Matthews

    Communication Lower Bounds for Strassen-like matmul algorithms

    bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

    bull Extends up to M = n2 p2ω bull Extends to rectangular case multiply (mxn)(nxp) in q mults

    ndash words_moved = Ω (flopsM^(logmpq -1))

    bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz also in JACMbull Is the lower bound attainable

    Classical O(n3) matmul

    words_moved =Ω (M(nM12)3P)

    Strassenrsquos O(nlg7) matmul

    words_moved =Ω (M(nM12)lg7P)

    Strassen-like O(nω) matmul

    words_moved =Ω (M(nM12)ωP)

    vs

    Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

    Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

    CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

    Communication Avoiding Parallel Strassen (CAPS)

    Best way to interleaveBFS and DFS is an tuning parameter

    26

    Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

    Speedups 24-184(over previous Strassen-based algorithms)

    Invited to appear as Research Highlight in CACM

    Strassen-like beyond matmul

    bull Thm (D Dumitriu Holtzrsquo07) Any Strassen-like O(nω) matmul algorithm can be used to build a numerically stable O(nω+η) algorithm for any ηgt0 for Ax=b least squares eig SVD hellipndash ηgt0 needed to deal with numerical stabilityndash Strassen already stable so η=0

    bull Thm For sequential versions of these algorithms Words_moved = O(nω+ηM(ω+η)2 ndash 1 + n2 log n) ie attain expected lower bound

    Ballard D Holtz Schwartz

    Cache and Network Oblivious Algorithms

    bull Motivation Minimizes communication at every level of a hierarchical system without tuning parameters (in theory)ndash Not always 25D Matmul on BGP was topology aware

    bull CAPS Divide-and-conquer choose BFS or DFS to adapt to processors available memory

    bull CARMAndash Divide-and-conquer classical matmul divide largest of 3

    dimensions to create two subproblemsndash Choose BFS or DFS to adapt to processors available memory

    CARMA Performance Distributed Memory

    Square m = k = n = 6144

    ScaLAPACK

    CARMA

    Peak

    (log)

    (log)

    Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA

    CARMA Performance Distributed Memory

    Inner Product m = n = 192 k = 6291456

    ScaLAPACK

    CARMAPeak

    (log)

    (log)

    Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA

    CARMA Performance Shared Memory

    Square m = k = n

    MKL (double)CARMA (double)

    MKL (single)CARMA (single)

    Peak (single)

    Peak (double)

    (log)

    (linear)

    Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

    CARMA Performance Shared Memory

    Inner Product m = n = 64

    MKL (double)

    CARMA (double)

    MKL (single)

    CARMA (single)

    (log)

    (linear)

    Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

    Why is CARMA Faster in Shared MemoryL3 Cache Misses

    Shared Memory Inner Product (m = n = 64 k = 524288)

    97 Fewer Misses

    86 Fewer Misses

    (linear)

    Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

    ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

    ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

    bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

    bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

    One-sided Factorizations (LU QR) so farbull Classical Approach for i=1 to n update column i update trailing matrixbull words_moved = O(n3)

    35

    bull Blocked Approach (LAPACK) for i=1 to nb update block i of b columns update trailing matrixbull words moved = O(n3M13)

    bull Recursive Approach func factor(A) if A has 1 column update it

    else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)

    bull None of these approaches minimizes messagesbull Parallel case Partial

    Pivoting =gt n reductionsbull Need another idea

    TSQR An Architecture-Dependent Algorithm

    W =

    W0

    W1

    W2

    W3

    R00

    R10

    R20

    R30

    R01

    R11

    R02Parallel

    W =

    W0

    W1

    W2

    W3

    R01 R02

    R00

    R03

    SequentialStreaming

    W =

    W0

    W1

    W2

    W3

    R00

    R01R01

    R11

    R02

    R11

    R03

    Dual Core

    Can choose reduction tree dynamicallyMulticore Multisocket Multirack Multisite Out-of-core

    Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

    Wnxb =

    W1

    W2

    W3

    W4

    P1middotL1middotU1

    P2middotL2middotU2

    P3middotL3middotU3

    P4middotL4middotU4

    =

    Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

    W1rsquoW2rsquoW3rsquoW4rsquo

    P12middotL12middotU12

    P34middotL34middotU34

    =Choose b pivot rows call them W12rsquo

    Choose b pivot rows call them W34rsquo

    W12rsquoW34rsquo

    = P1234middotL1234middotU1234

    Choose b pivot rows

    Go back to W and use these b pivot rows (move them to top do LU without pivoting)

    37

    Minimizing Communication in TSLU

    W = W1

    W2

    W3

    W4

    LULULULU

    LU

    LULUParallel

    W = W1

    W2

    W3

    W4

    LULU

    LU

    LUSequentialStreaming

    W = W1

    W2

    W3

    W4

    LULU LU

    LULU

    LULU

    Dual Core

    Can choose reduction tree dynamically to match architecture as before

    38

    Making TSLU Numerically Stable

    bull Details matterndash Going up the tree we could do LU either on original rows of A

    (tournament pivoting) or computed rows of Undash Only tournament pivoting stable

    bull ldquoThmrdquo New scheme as stable as Partial Pivoting (GEPP) in following sense Get same Schur complements as GEPP applied to different input matrix whose entries are blocks taken from input A

    bull Why just a ldquoThmrdquo

    39

    Stability of LU using TSLU CALU

    Summer School Lecture 4 40

    bull Empirical testingndash Both random matrices and ldquospecial onesrdquondash Both binary tree (BCALU) and flat-tree (FCALU)ndash 3 metrics ||PA-LU||||A|| normwise and componentwise backward errorsndash See [D Grigori Xiang 2010] for details

    Why is stability of TSLU just a ldquoThmrdquo

    bull Proof is correct ndash in exact arithmeticbull Experiment

    ndash Generate 100 random 6x6 rank 3 matrices in Matlabndash [LUP] = lu(A) do LU without pivoting on PA compare L factors are

    they the samebull Compute || L ndash Lnp || A few 0rsquos A few infinrsquos a few NaNsbull Rest mostly O(1)

    ndash Why Floating point is nonassociative doing arithmetic in different order gives different rounding errors

    ndash Same experiment with rank 6 matrices || L ndash Lnp || usually nonzero O(macheps)

    ndash Same experiment with 20x20 rank 4 matrices || L ndash Lnp || often O(103)

    bull Much harder to break TSLU but possiblendash Occurred when using TSLU to factorize a low-rank subdiagonal

    panel in symmetric-indefinite factorization 41

    Fixing TSLU

    bull Run TSLU quickly test for stability fix if necessary (rare)

    bull Test conditioning of U if not tiny (usual case) proceed elsebull Compute || L || if not big (usual case) proceed elsebull Factor A = QR using TSQR thenbull Factor Q = PLU using TSLU thenbull A = PL(UR) with UR as upper triangular factor

    bull Last topic in lecture how to guarantee floating point reproducibility

    42

    2D CALU with Tournament Pivoting

    43

    25D CALU with Tournament Pivoting (c=4 copies)

    44

    Exascale Machine ParametersSource DOE Exascale Workshop

    bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

    Exascale predicted speedupsfor Gaussian Elimination

    2D CA-LU vs ScaLAPACK-LU

    log2 (p)

    log 2

    (n2 p

    ) =

    log 2

    (mem

    ory_

    per_

    proc

    )

    Up to 29x

    25D vs 2D LUWith and Without Pivoting

    Other CA algorithms for Ax=b least squares(13)

    bull A symmetric and indefinitendash Seek factorization that retains symmetry PAPT = LDLT D

    ldquosimplerdquobull Save frac12 flops preserve inertia

    ndash Usual approach Bunch-Kaufmanbull D block diagonal with 1x1 and 2x2 blocksbull Pivot search down column along row (lots of communication)

    ndash Alternative Aasenbull D = tridiagonal = Tbull Two steps

    ndash PAPT = LTLT where T is banded using TSLU

    48

    0 0

    0

    0 0

    0

    0

    hellip

    hellip

    ndash Solvefactor narrow band problem with Tbull Up to 28x faster than MKL Best Paper at IPDPSrsquo13

    Other CA algorithms for Ax=b least squares (23)bull Minimizing bandwidth and latency for sequential GEPP

    ndash So far could not do partial pivoting and minimize messages just words

    ndash Challengebull Column layout good for choosing pivots bad for matmulbull Blocked layout good for matmul bad for choosing pivots

    ndash Solution use both layouts switching between thembull ldquoShape Morphing LUrdquo or SMLU

    49

    bull func factor(A) if A has 1 column update it else factor(left half of A)

    update right half of A

    factor(right half of A)

    bull Words = O(n3M12)

    bull Messages = O(n3M)

    bull func factor(A) if A has 1 column update it else factor(left half of A) reshape to recursive block format update right half of A reshape to columnwise format factor(right half of A)

    bull Words = O(n3M12)

    bull Messages = O(n3M32)

    Other CA algorithms for Ax=b least squares (33)bull Need for pivoting arises beyond LU in QR

    ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)

    ndash Usual approach like Partial Pivotingbull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound

    ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two

    groups of b columns either using usual approach or something better (GuEisenstat)

    bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix

    ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 50

    Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

    ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

    ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

    bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

    bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

    What about sparse matrices (13)

    bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

    bull But canrsquot reorder outer loop for 25D need another ideabull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = AB

    ndash Dependencies ok 25D works just different semiringbull Kleenersquos Algorithm

    52

    for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

    D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

    Performance of 25D APSP using Kleene

    53

    Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

    62xspeedup

    2x speedup

    What about sparse matrices (23)

    bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of G(A)

    have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

    bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering for

    2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

    bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

    multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

    separators)

    54

    What about sparse matrices (33)

    bull If matrix stays very sparse lower bound unattainable new one bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

    data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

    bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

    Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

    bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

    along dimensions most likely to minimize cost55

    Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

    ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

    ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

    bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

    bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

    Symmetric Eigenproblem and SVD

    bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd

    bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12

    ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea

    b+1

    b+1

    Successive Band Reduction (BischofLangSun)

    1

    b+1

    b+1

    d+1

    c

    Successive Band Reduction (BischofLangSun)

    b = bandwidthc = columnsd = diagonalsConstraint c+d b

    1Q1

    b+1

    b+1

    d+1

    c

    b = bandwidthc = columnsd = diagonalsConstraint c+d b

    Successive Band Reduction (BischofLangSun)

    12

    Q1

    b+1

    b+1

    d+1

    d+c

    d+c

    c

    b = bandwidthc = columnsd = diagonalsConstraint c+d b

    Successive Band Reduction (BischofLangSun)

    1

    12

    Q1

    Q1T

    b+1

    b+1

    d+1

    d+1

    cd+c

    d+c

    c

    b = bandwidthc = columnsd = diagonalsConstraint c+d b

    Successive Band Reduction (BischofLangSun)

    1

    1

    2

    2Q1

    Q1T

    b+1

    b+1

    d+1

    d+1

    cd+c

    d+c

    d+c

    d+c

    c

    b = bandwidthc = columnsd = diagonalsConstraint c+d b

    Successive Band Reduction (BischofLangSun)

    1

    1

    2

    2

    3

    3

    Q1

    Q1T

    Q2

    Q2T

    b+1

    b+1

    d+1

    d+1

    d+c

    d+c

    d+c

    d+c

    c

    c

    b = bandwidthc = columnsd = diagonalsConstraint c+d b

    Successive Band Reduction (BischofLangSun)

    1

    1

    2

    2

    3

    3

    4

    4

    Q1

    Q1T

    Q2

    Q2T

    Q3

    Q3T

    b+1

    b+1

    d+1

    d+1

    d+c

    d+c

    d+c

    d+c

    c

    c

    b = bandwidthc = columnsd = diagonalsConstraint c+d b

    Successive Band Reduction (BischofLangSun)

    1

    1

    2

    2

    3

    3

    4

    4

    5

    5

    Q1

    Q1T

    Q2

    Q2T

    Q3

    Q3T

    Q4

    Q4T

    b+1

    b+1

    d+1

    d+1

    c

    c

    d+c

    d+c

    d+c

    d+c

    b = bandwidthc = columnsd = diagonalsConstraint c+d b

    Successive Band Reduction (BischofLangSun)

    1

    1

    2

    2

    3

    3

    4

    4

    5

    5

    Q5T

    Q1

    Q1T

    Q2

    Q2T

    Q3

    Q3T

    Q5

    Q4

    Q4T

    b+1

    b+1

    d+1

    d+1

    c

    c

    d+c

    d+c

    d+c

    d+c

    b = bandwidthc = columnsd = diagonalsConstraint c+d b

    Successive Band Reduction (BischofLangSun)

    1

    1

    2

    2

    3

    3

    4

    4

    5

    5

    6

    6

    Q5T

    Q1

    Q1T

    Q2

    Q2T

    Q3

    Q3T

    Q5

    Q4

    Q4T

    b+1

    b+1

    d+1

    d+1

    c

    c

    d+c

    d+c

    d+c

    d+c

    b = bandwidthc = columnsd = diagonalsConstraint c+d b

    Successive Band Reduction (BischofLangSun)

    Conventional vs CA - SBR

    Conventional Communication-Avoiding

    Touch all data 4 times Touch all data once

    >
    >

    Speedups of Sym Band Reductionvs DSBTRD

    bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

    bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

    bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

    bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

    bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

    Nonsymmetric Eigenproblem

    bull No apparent way to modify standard algorithmbull Instead Spectral Divide-and-Conquer

    ndash Find orthogonal matrix Q whose leading columns span an invariant subspace of A

    ndash QTAQ will be block upper triangularndash Apply recursively to A11 A22

    ndash Depends on randomization1 Randomized Rank Revealing QR decomposition2 Randomized location to try splitting spectrum

    A11 A12

    ε A22

    Attaining the Lower bounds SequentialLegend[Existing][Ours][Math-Lib][Random]

    Two Levels Memory Hierarchy

    Words Messages Words Messages

    BLAS-3 [FLPRrsquo99][BDLSTrsquo13][MKL etc] [FLPRrsquo99][BDLSTrsquo13][MKL etc]

    Cholesky[Grsquo97][APrsquo00]

    [LAPACK][BDHSrsquo09]

    [Grsquo97][APrsquo00][BDHSrsquo09] [Grsquo97][APrsquo00][BDHSrsquo09]

    Sym Indefinite [BBDDDPSTYrsquo13] [BBDDDPSTYrsquo13]

    LU[Grsquo97][Trsquo97]

    [GDXrsquo11][BDLSTrsquo13]

    [GDXrsquo11][BDLSTrsquo13]

    [Grsquo97][Trsquo97] [BDLSTrsquo13] [BDLSTrsquo13]

    QR[EGrsquo98][FWrsquo03]

    [DGHLrsquo12][BDLSTrsquo13]

    [FWrsquo03][DGHLrsquo12][BDLSTrsquo13]

    [EGrsquo98][FWrsquo03][BDLSTrsquo13]

    [FWrsquo03][BDLSTrsquo13]

    Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

    Sym Eig amp SVD [BDDrsquo11][BDKrsquo13] [BDDrsquo11]

    Non Sym Eig [BDDrsquo11] [BDDrsquo11]

    Legend[Existing][Ours][Math-Lib][Random]

    Words (BW) Messages (L) Saving factor

    BLAS-3 [AGZrsquo94][MTrsquo99][ScaLAPACK][Crsquo69][vGWrsquo97][SDrsquo11] L nP12

    Cholesky [ScaLAPACK][Trsquo99][SDrsquo11] L nP12

    Sym Indefinite [BBDDDPSTYrsquo13][ScaLAPACK] [BBDDDPSTYrsquo13] L nP12

    LU [ScaLAPACK][GDXrsquo11][Trsquo99][SDrsquo11] [GDXrsquo11][Trsquo99][SDrsquo11] L nP12

    QR [ScaLAPACK][DGHLrsquo12] [Trsquo99] [DGHLrsquo12][Trsquo99] L nP12

    Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

    Sym Eig amp SVD [BDDrsquo11][BDKrsquo13][ScaLAPACK] [BDDrsquo11][BDKrsquo13] L nP12

    Non-Sym Eig [BDDrsquo11] [BDDrsquo11] BW P12 L n

    Attaining with extra memory 25D M=(cn2P)

    Attaining the Lower bounds Parallel 2DM=(n2P)(Ignoring poly-log(P) factors words = ( n2 P12) messages = (P12)

    Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

    ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

    ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

    bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

    bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

    Avoiding Communication in Iterative Linear Algebra

    bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

    bull Conjugate Gradients (CG) GMRES Lanczos Arnoldi hellip bull Goal minimize communication

    ndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

    bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

    ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

    bull Lots of speed up possible (modeled and measured)ndash Price some redundant computationndash Challenges Poor partitioning Preconditioning Num Stability

    75

    Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

    ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

    ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

    bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

    bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

    Example The Difficulty of Tuning SpMV

    bull n = 21200bull nnz = 15 M

    bull Source NASA structural analysis problem (raefsky)

    77

    Example The Difficulty of Tuning

    bull n = 21200bull nnz = 15 M

    bull Source NASA structural analysis problem (raefsky)

    bull 8x8 dense substructure exploit this to limit mem_refs

    78

    Speedups on Itanium 2 The Need for Search

    Reference

    Best 4x2

    Mflops

    Mflops

    79

    Register Profile Itanium 2

    190 Mflops

    1190 Mflops

    80

    Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16

    Itanium 2 - 33Itanium 1 - 8

    252 Mflops

    122 Mflops

    820 Mflops

    459 Mflops

    247 Mflops

    107 Mflops

    12 Gflops

    190 Mflops

    Another example of tuning challenges for SpMV

    bull Ex11 matrix (fluid flow)

    bull More complicated non-zero structure in general

    bull N = 16614bull NNZ = 11M

    82

    Zoom in to top corner

    bull More complicated non-zero structure in general

    bull N = 16614bull NNZ = 11M

    83

    3x3 blocks look natural buthellip

    bull Example 3x3 blockingndash Logical grid of 3x3 cells

    bull But would lead to lots of ldquofill-inrdquo

    84

    Extra Work Can Improve Efficiency

    bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

    bull On Pentium III 15x speedupndash Actual mflop rate 152 = 225 higher

    85

    Source Accelerator Cavity Design Problem (Ko via Husbands)

    86

    100x100 Submatrix Along Diagonal

    Summer School Lecture 787

    Post-RCM Reordering

    88

    Effect of Combined RCM+TSP Reordering

    Before Green + RedAfter Green + Blue

    Summer School Lecture 789

    2x speedups on Pentium 4 Power 4 hellip

    Summary of Other Performance Optimizations

    bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

    bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

    bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

    90

    Optimized Sparse Kernel Interface - OSKI

    bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

    bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

    bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

    software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

    91

    Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

    ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

    ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

    bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

    bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

    93

    Example Classical Conjugate Gradient (CG)

    SpMVs and dot products require communication in

    each iteration

    via CA Matrix Powers Kernel

    Global reduction to compute G

    94

    Example CA-Conjugate Gradient

    Local computations within inner loop require

    no communication

    Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

    ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

    ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

    bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

    bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

    96

    Slower convergence due

    to roundoff

    Loss of accuracy due to roundoff

    At s = 16 monomial basis is rank deficient Method breaks down

    Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

    CA-CG (monomial)CG

    machine precision

    97

    Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

    ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

    ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

    bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

    bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

    What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

    bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

    bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

    matrices

    Explicit (O(nnz)) Implicit (o(nnz))

    Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

    Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

    Indices

    Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

    ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

    ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

    bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

    bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

    101

    bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

    ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

    demanded by customers (construction engineers) otherwise they donrsquot believe results

    ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

    ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

    Reproducible Floating Point Computation

    Absolute Error for Random Vectors

    Same magnitude opposite signs

    Intel MKL non-reproducibility

    Relative Error for Orthogonal vectors

    Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

    Sign notreproducible

    103

    bull Consider summation or dot productbull Goals

    1 Same answer independent of layout processors order of summands

    2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

    bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

    GoalsApproaches for Reproducibility

    104

    Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

    Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

    Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

    bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

    Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

    bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

    Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

    Instruments NEC Nokia NVIDIA Samsung Oracle

    bull bebopcsberkeleyedu

    Summary

    Donrsquot Communichellip

    106

    Time to redesign all linear algebra n-body hellip algorithms and software

    (and compilers)

    • Implementing Communication-Avoiding Algorithms
    • Why avoid communication
    • Goals
    • Outline
    • Outline (2)
    • Lower bound for all ldquon3-likerdquo linear algebra
    • Lower bound for all ldquon3-likerdquo linear algebra (2)
    • Lower bound for all ldquon3-likerdquo linear algebra (3)
    • Limits to parallel scaling (12)
    • Limits to parallel scaling (22)
    • Can we attain these lower bounds
    • Outline (3)
    • 25D Matrix Multiplication
    • 25D Matrix Multiplication (2)
    • 25D Matmul on BGP 16K nodes 64K cores (2)
    • Perfect Strong Scaling ndash in Time and Energy (12)
    • Perfect Strong Scaling ndash in Time and Energy (22)
    • Handling Heterogeneity
    • Application to Tensor Contractions
    • C(ijk) = Σm A(ijm)B(mk)
    • Application to Tensor Contractions (2)
    • Communication Lower Bounds for Strassen-like matmul algorithms
    • vs
    • Slide 26
    • Strassen-like beyond matmul
    • Cache and Network Oblivious Algorithms
    • CARMA Performance Distributed Memory
    • CARMA Performance Distributed Memory (2)
    • CARMA Performance Shared Memory
    • CARMA Performance Shared Memory (2)
    • Why is CARMA Faster in Shared Memory
    • Outline (4)
    • One-sided Factorizations (LU QR) so far
    • TSQR An Architecture-Dependent Algorithm
    • Back to LU Using similar idea for TSLU as TSQR Use reduction
    • Minimizing Communication in TSLU
    • Making TSLU Numerically Stable
    • Stability of LU using TSLU CALU
    • Why is stability of TSLU just a ldquoThmrdquo
    • Fixing TSLU
    • 2D CALU with Tournament Pivoting
    • 25D CALU with Tournament Pivoting (c=4 copies)
    • Exascale Machine Parameters Source DOE Exascale Workshop
    • Exascale predicted speedups for Gaussian Elimination 2D CA
    • 25D vs 2D LU With and Without Pivoting
    • Other CA algorithms for Ax=b least squares(13)
    • Other CA algorithms for Ax=b least squares (23)
    • Other CA algorithms for Ax=b least squares (33)
    • Outline (5)
    • What about sparse matrices (13)
    • Performance of 25D APSP using Kleene
    • What about sparse matrices (23)
    • What about sparse matrices (33)
    • Outline (6)
    • Symmetric Eigenproblem and SVD
    • Slide 58
    • Slide 59
    • Slide 60
    • Slide 61
    • Slide 62
    • Slide 63
    • Slide 64
    • Slide 65
    • Slide 66
    • Slide 67
    • Slide 68
    • Conventional vs CA - SBR
    • Speedups of Sym Band Reduction vs DSBTRD
    • Nonsymmetric Eigenproblem
    • Attaining the Lower bounds Sequential
    • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
    • Outline (7)
    • Avoiding Communication in Iterative Linear Algebra
    • Outline (8)
    • Example The Difficulty of Tuning SpMV
    • Example The Difficulty of Tuning
    • Speedups on Itanium 2 The Need for Search
    • Register Profile Itanium 2
    • Register Profiles IBM and Intel IA-64
    • Another example of tuning challenges for SpMV
    • Zoom in to top corner
    • 3x3 blocks look natural buthellip
    • Extra Work Can Improve Efficiency
    • Slide 86
    • Slide 87
    • Slide 88
    • Slide 89
    • Summary of Other Performance Optimizations
    • Optimized Sparse Kernel Interface - OSKI
    • Outline (9)
    • Example Classical Conjugate Gradient (CG)
    • Example CA-Conjugate Gradient
    • Outline (10)
    • Slide 96
    • Slide 97
    • Outline (11)
    • What is a ldquosparse matrixrdquo
    • Outline (12)
    • Reproducible Floating Point Computation
    • Intel MKL non-reproducibility
    • GoalsApproaches for Reproducibility
    • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
    • Collaborators and Supporters
    • Summary

      Goals

      3

      bull Redesign algorithms to avoid communicationbull Between all memory hierarchy levels

      bull L1 L2 DRAM network etc bull Attain lower bounds if possible

      bull Current algorithms often far from lower boundsbull Large speedups and energy savings possible

      Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

      ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

      ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

      bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

      bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

      Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

      ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

      ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

      bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

      bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

      Lower bound for all ldquon3-likerdquo linear algebra

      bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no

      matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)

      6

      bull Let M = ldquofastrdquo memory size (per processor)

      words_moved (per processor) = (flops (per processor) M12 )

      messages_sent (per processor) = (flops (per processor) M32 )

      bull Parallel case assume either load or memory balanced

      Lower bound for all ldquon3-likerdquo linear algebra

      bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no

      matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)

      7

      bull Let M = ldquofastrdquo memory size (per processor)

      words_moved (per processor) = (flops (per processor) M12 )

      messages_sent ge words_moved largest_message_size

      bull Parallel case assume either load or memory balanced

      Lower bound for all ldquon3-likerdquo linear algebra

      bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no

      matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)

      8

      bull Let M = ldquofastrdquo memory size (per processor)

      words_moved (per processor) = (flops (per processor) M12 )

      messages_sent (per processor) = (flops (per processor) M32 )

      bull Parallel case assume either load or memory balanced

      SIAM SIAGLinear Algebra Prize 2012Ballard D Holtz Schwartz

      Limits to parallel scaling (12)bull Consider dense case flops_per_proc = n3P

      ndash Words = (n3(PM12 ))ndash Messages = (n3(PM32 ))

      bull What is M Must be at least n2P to hold datandash Words = (n2P12 )ndash Messages = (P12 )

      bull But if M fixed looks like perfect strong scaling in timendash Flops Words Messages all proportional to 1P

      bull Ditto for energy if we count energy costs in joules hellipndash Per flop per word moved per messagendash Per word per second for data stored in memory Mndash Per second for leakage cooling hellip

      bull How big can we make P and M

      Limits to parallel scaling (22)

      bull Consider dense case flops_per_proc = n3Pndash Words = (n3(PM12 ))ndash Messages = (n3(PM32 ))

      bull How big can we make P and Mbull Assume we start with 1 copy of inputs A and B

      ndash Otherwise no communication may be neededbull Thm Words= (n2P23 ) independent of M

      bull Reached when M = n2P23 too or P = n3M32 and Messages = (1) (log P in practice)bull Attained by 25D algorithm when c=P13 (ldquo3D algrdquo)bull Can keep increasing P until P = n3 Words = Messages = (1) (log n in practice)

      Can we attain these lower bounds

      bull Do conventional dense algorithms as implemented in LAPACK and ScaLAPACK attain these boundsndash Often not

      bull If not are there other algorithms that dondash Yes for much of dense linear algebrandash New algorithms with new numerical properties

      new ways to encode answers new data structures

      ndash Not just loop transformations (need those too)bull Only a few sparse algorithms so farbull Lots of work in progress

      ndash Algorithms Energy Heterogeneous Processors hellip11

      Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

      ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

      ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

      bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

      bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

      25D Matrix Multiplication

      bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid

      c

      (Pc)12

      (Pc)12

      Example P = 32 c = 2

      25D Matrix Multiplication

      bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid

      k

      j

      iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12

      (1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)

      (3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)

      25D Matmul on BGP 16K nodes 64K coresc = 16 copies

      Distinguished Paper Award EuroParrsquo11 (Solomonik D)SCrsquo11 paper by Solomonik Bhatele D

      12x faster

      27x faster

      Perfect Strong Scaling ndash in Time and Energy (12)

      bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2

      bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model

      ndash γT βT αT = secs per flop per word_moved per message of size m

      bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ] = T(P)cbull Notation for energy model

      ndash γE βE αE = joules for same operations

      ndash δE = joules per word of memory used per sec

      ndash εE = joules per sec for leakage etc

      bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP) = E(P)bull Perfect scaling extends to N-body Strassen hellip

      Perfect Strong Scaling ndash in Time and Energy (22)

      bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ] = T(P)cbull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP) = E(P)

      bull Can use these formulas to answer many questions such asndash How to choose p and M to minimize energy E needed for computationndash Given max allowed runtime T what is minimum energy E needed to achieve itndash Given max allowed energy E what is the minimum runtime T attainablendash Can we minimize the average power P = ETndash Given target energy efficiency what architectural parameters are needed to

      achieve itbull Can we attain 75 GflopsWattbull Can we attain an exaflop for 20 MWatts

      Handling Heterogeneitybull Suppose each of P processors could differ

      ndash γi = secflop βi = secword αi = secmessage Mi = memory

      bull What is optimal assignment of work Fi to minimize timendash Ti = Fi γi + Fi βi Mi

      12 + Fi αi Mi32 = Fi [γi + βi Mi

      12 + αi Mi32] = Fi ξi

      ndash Choose Fi so Σi Fi = n3 and minimizing T = maxi Ti

      ndash Answer Fi = n3(1ξi)Σj(1ξj) and T = n3Σj(1ξj)

      bull Optimal Algorithm for nxn matmulndash Recursively divide into 8 half-sized subproblemsndash Assign subproblems to processor i to add up to Fi flops

      bull Works for Strassen other algorithmshellip

      Application to Tensor Contractions

      bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

      bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

      bull Heavily used in electronic structure calculationsndash Ex NWChem

      bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetries

      ndash Solomonik Hammond Matthews

      C(ijk) = Σm A(ijm)B(mk)

      A3-fold symm

      B2-fold symm

      C2-fold symm

      Application to Tensor Contractions

      bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

      bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

      bull Heavily used in electronic structure calculationsndash Ex NWChem for coupled cluster (CC) approach to Schroedinger eqn

      bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetriesndash Up to 3x faster running CC than NWChem on 3072 cores of Cray XE6 ndash Solomonik Hammond Matthews

      Communication Lower Bounds for Strassen-like matmul algorithms

      bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

      bull Extends up to M = n2 p2ω bull Extends to rectangular case multiply (mxn)(nxp) in q mults

      ndash words_moved = Ω (flopsM^(logmpq -1))

      bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz also in JACMbull Is the lower bound attainable

      Classical O(n3) matmul

      words_moved =Ω (M(nM12)3P)

      Strassenrsquos O(nlg7) matmul

      words_moved =Ω (M(nM12)lg7P)

      Strassen-like O(nω) matmul

      words_moved =Ω (M(nM12)ωP)

      vs

      Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

      Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

      CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

      Communication Avoiding Parallel Strassen (CAPS)

      Best way to interleaveBFS and DFS is an tuning parameter

      26

      Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

      Speedups 24-184(over previous Strassen-based algorithms)

      Invited to appear as Research Highlight in CACM

      Strassen-like beyond matmul

      bull Thm (D Dumitriu Holtzrsquo07) Any Strassen-like O(nω) matmul algorithm can be used to build a numerically stable O(nω+η) algorithm for any ηgt0 for Ax=b least squares eig SVD hellipndash ηgt0 needed to deal with numerical stabilityndash Strassen already stable so η=0

      bull Thm For sequential versions of these algorithms Words_moved = O(nω+ηM(ω+η)2 ndash 1 + n2 log n) ie attain expected lower bound

      Ballard D Holtz Schwartz

      Cache and Network Oblivious Algorithms

      bull Motivation Minimizes communication at every level of a hierarchical system without tuning parameters (in theory)ndash Not always 25D Matmul on BGP was topology aware

      bull CAPS Divide-and-conquer choose BFS or DFS to adapt to processors available memory

      bull CARMAndash Divide-and-conquer classical matmul divide largest of 3

      dimensions to create two subproblemsndash Choose BFS or DFS to adapt to processors available memory

      CARMA Performance Distributed Memory

      Square m = k = n = 6144

      ScaLAPACK

      CARMA

      Peak

      (log)

      (log)

      Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA

      CARMA Performance Distributed Memory

      Inner Product m = n = 192 k = 6291456

      ScaLAPACK

      CARMAPeak

      (log)

      (log)

      Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA

      CARMA Performance Shared Memory

      Square m = k = n

      MKL (double)CARMA (double)

      MKL (single)CARMA (single)

      Peak (single)

      Peak (double)

      (log)

      (linear)

      Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

      CARMA Performance Shared Memory

      Inner Product m = n = 64

      MKL (double)

      CARMA (double)

      MKL (single)

      CARMA (single)

      (log)

      (linear)

      Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

      Why is CARMA Faster in Shared MemoryL3 Cache Misses

      Shared Memory Inner Product (m = n = 64 k = 524288)

      97 Fewer Misses

      86 Fewer Misses

      (linear)

      Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

      ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

      ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

      bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

      bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

      One-sided Factorizations (LU QR) so farbull Classical Approach for i=1 to n update column i update trailing matrixbull words_moved = O(n3)

      35

      bull Blocked Approach (LAPACK) for i=1 to nb update block i of b columns update trailing matrixbull words moved = O(n3M13)

      bull Recursive Approach func factor(A) if A has 1 column update it

      else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)

      bull None of these approaches minimizes messagesbull Parallel case Partial

      Pivoting =gt n reductionsbull Need another idea

      TSQR An Architecture-Dependent Algorithm

      W =

      W0

      W1

      W2

      W3

      R00

      R10

      R20

      R30

      R01

      R11

      R02Parallel

      W =

      W0

      W1

      W2

      W3

      R01 R02

      R00

      R03

      SequentialStreaming

      W =

      W0

      W1

      W2

      W3

      R00

      R01R01

      R11

      R02

      R11

      R03

      Dual Core

      Can choose reduction tree dynamicallyMulticore Multisocket Multirack Multisite Out-of-core

      Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

      Wnxb =

      W1

      W2

      W3

      W4

      P1middotL1middotU1

      P2middotL2middotU2

      P3middotL3middotU3

      P4middotL4middotU4

      =

      Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

      W1rsquoW2rsquoW3rsquoW4rsquo

      P12middotL12middotU12

      P34middotL34middotU34

      =Choose b pivot rows call them W12rsquo

      Choose b pivot rows call them W34rsquo

      W12rsquoW34rsquo

      = P1234middotL1234middotU1234

      Choose b pivot rows

      Go back to W and use these b pivot rows (move them to top do LU without pivoting)

      37

      Minimizing Communication in TSLU

      W = W1

      W2

      W3

      W4

      LULULULU

      LU

      LULUParallel

      W = W1

      W2

      W3

      W4

      LULU

      LU

      LUSequentialStreaming

      W = W1

      W2

      W3

      W4

      LULU LU

      LULU

      LULU

      Dual Core

      Can choose reduction tree dynamically to match architecture as before

      38

      Making TSLU Numerically Stable

      bull Details matterndash Going up the tree we could do LU either on original rows of A

      (tournament pivoting) or computed rows of Undash Only tournament pivoting stable

      bull ldquoThmrdquo New scheme as stable as Partial Pivoting (GEPP) in following sense Get same Schur complements as GEPP applied to different input matrix whose entries are blocks taken from input A

      bull Why just a ldquoThmrdquo

      39

      Stability of LU using TSLU CALU

      Summer School Lecture 4 40

      bull Empirical testingndash Both random matrices and ldquospecial onesrdquondash Both binary tree (BCALU) and flat-tree (FCALU)ndash 3 metrics ||PA-LU||||A|| normwise and componentwise backward errorsndash See [D Grigori Xiang 2010] for details

      Why is stability of TSLU just a ldquoThmrdquo

      bull Proof is correct ndash in exact arithmeticbull Experiment

      ndash Generate 100 random 6x6 rank 3 matrices in Matlabndash [LUP] = lu(A) do LU without pivoting on PA compare L factors are

      they the samebull Compute || L ndash Lnp || A few 0rsquos A few infinrsquos a few NaNsbull Rest mostly O(1)

      ndash Why Floating point is nonassociative doing arithmetic in different order gives different rounding errors

      ndash Same experiment with rank 6 matrices || L ndash Lnp || usually nonzero O(macheps)

      ndash Same experiment with 20x20 rank 4 matrices || L ndash Lnp || often O(103)

      bull Much harder to break TSLU but possiblendash Occurred when using TSLU to factorize a low-rank subdiagonal

      panel in symmetric-indefinite factorization 41

      Fixing TSLU

      bull Run TSLU quickly test for stability fix if necessary (rare)

      bull Test conditioning of U if not tiny (usual case) proceed elsebull Compute || L || if not big (usual case) proceed elsebull Factor A = QR using TSQR thenbull Factor Q = PLU using TSLU thenbull A = PL(UR) with UR as upper triangular factor

      bull Last topic in lecture how to guarantee floating point reproducibility

      42

      2D CALU with Tournament Pivoting

      43

      25D CALU with Tournament Pivoting (c=4 copies)

      44

      Exascale Machine ParametersSource DOE Exascale Workshop

      bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

      Exascale predicted speedupsfor Gaussian Elimination

      2D CA-LU vs ScaLAPACK-LU

      log2 (p)

      log 2

      (n2 p

      ) =

      log 2

      (mem

      ory_

      per_

      proc

      )

      Up to 29x

      25D vs 2D LUWith and Without Pivoting

      Other CA algorithms for Ax=b least squares(13)

      bull A symmetric and indefinitendash Seek factorization that retains symmetry PAPT = LDLT D

      ldquosimplerdquobull Save frac12 flops preserve inertia

      ndash Usual approach Bunch-Kaufmanbull D block diagonal with 1x1 and 2x2 blocksbull Pivot search down column along row (lots of communication)

      ndash Alternative Aasenbull D = tridiagonal = Tbull Two steps

      ndash PAPT = LTLT where T is banded using TSLU

      48

      0 0

      0

      0 0

      0

      0

      hellip

      hellip

      ndash Solvefactor narrow band problem with Tbull Up to 28x faster than MKL Best Paper at IPDPSrsquo13

      Other CA algorithms for Ax=b least squares (23)bull Minimizing bandwidth and latency for sequential GEPP

      ndash So far could not do partial pivoting and minimize messages just words

      ndash Challengebull Column layout good for choosing pivots bad for matmulbull Blocked layout good for matmul bad for choosing pivots

      ndash Solution use both layouts switching between thembull ldquoShape Morphing LUrdquo or SMLU

      49

      bull func factor(A) if A has 1 column update it else factor(left half of A)

      update right half of A

      factor(right half of A)

      bull Words = O(n3M12)

      bull Messages = O(n3M)

      bull func factor(A) if A has 1 column update it else factor(left half of A) reshape to recursive block format update right half of A reshape to columnwise format factor(right half of A)

      bull Words = O(n3M12)

      bull Messages = O(n3M32)

      Other CA algorithms for Ax=b least squares (33)bull Need for pivoting arises beyond LU in QR

      ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)

      ndash Usual approach like Partial Pivotingbull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound

      ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two

      groups of b columns either using usual approach or something better (GuEisenstat)

      bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix

      ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 50

      Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

      ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

      ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

      bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

      bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

      What about sparse matrices (13)

      bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

      bull But canrsquot reorder outer loop for 25D need another ideabull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = AB

      ndash Dependencies ok 25D works just different semiringbull Kleenersquos Algorithm

      52

      for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

      D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

      Performance of 25D APSP using Kleene

      53

      Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

      62xspeedup

      2x speedup

      What about sparse matrices (23)

      bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of G(A)

      have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

      bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering for

      2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

      bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

      multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

      separators)

      54

      What about sparse matrices (33)

      bull If matrix stays very sparse lower bound unattainable new one bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

      data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

      bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

      Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

      bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

      along dimensions most likely to minimize cost55

      Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

      ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

      ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

      bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

      bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

      Symmetric Eigenproblem and SVD

      bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd

      bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12

      ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea

      b+1

      b+1

      Successive Band Reduction (BischofLangSun)

      1

      b+1

      b+1

      d+1

      c

      Successive Band Reduction (BischofLangSun)

      b = bandwidthc = columnsd = diagonalsConstraint c+d b

      1Q1

      b+1

      b+1

      d+1

      c

      b = bandwidthc = columnsd = diagonalsConstraint c+d b

      Successive Band Reduction (BischofLangSun)

      12

      Q1

      b+1

      b+1

      d+1

      d+c

      d+c

      c

      b = bandwidthc = columnsd = diagonalsConstraint c+d b

      Successive Band Reduction (BischofLangSun)

      1

      12

      Q1

      Q1T

      b+1

      b+1

      d+1

      d+1

      cd+c

      d+c

      c

      b = bandwidthc = columnsd = diagonalsConstraint c+d b

      Successive Band Reduction (BischofLangSun)

      1

      1

      2

      2Q1

      Q1T

      b+1

      b+1

      d+1

      d+1

      cd+c

      d+c

      d+c

      d+c

      c

      b = bandwidthc = columnsd = diagonalsConstraint c+d b

      Successive Band Reduction (BischofLangSun)

      1

      1

      2

      2

      3

      3

      Q1

      Q1T

      Q2

      Q2T

      b+1

      b+1

      d+1

      d+1

      d+c

      d+c

      d+c

      d+c

      c

      c

      b = bandwidthc = columnsd = diagonalsConstraint c+d b

      Successive Band Reduction (BischofLangSun)

      1

      1

      2

      2

      3

      3

      4

      4

      Q1

      Q1T

      Q2

      Q2T

      Q3

      Q3T

      b+1

      b+1

      d+1

      d+1

      d+c

      d+c

      d+c

      d+c

      c

      c

      b = bandwidthc = columnsd = diagonalsConstraint c+d b

      Successive Band Reduction (BischofLangSun)

      1

      1

      2

      2

      3

      3

      4

      4

      5

      5

      Q1

      Q1T

      Q2

      Q2T

      Q3

      Q3T

      Q4

      Q4T

      b+1

      b+1

      d+1

      d+1

      c

      c

      d+c

      d+c

      d+c

      d+c

      b = bandwidthc = columnsd = diagonalsConstraint c+d b

      Successive Band Reduction (BischofLangSun)

      1

      1

      2

      2

      3

      3

      4

      4

      5

      5

      Q5T

      Q1

      Q1T

      Q2

      Q2T

      Q3

      Q3T

      Q5

      Q4

      Q4T

      b+1

      b+1

      d+1

      d+1

      c

      c

      d+c

      d+c

      d+c

      d+c

      b = bandwidthc = columnsd = diagonalsConstraint c+d b

      Successive Band Reduction (BischofLangSun)

      1

      1

      2

      2

      3

      3

      4

      4

      5

      5

      6

      6

      Q5T

      Q1

      Q1T

      Q2

      Q2T

      Q3

      Q3T

      Q5

      Q4

      Q4T

      b+1

      b+1

      d+1

      d+1

      c

      c

      d+c

      d+c

      d+c

      d+c

      b = bandwidthc = columnsd = diagonalsConstraint c+d b

      Successive Band Reduction (BischofLangSun)

      Conventional vs CA - SBR

      Conventional Communication-Avoiding

      Touch all data 4 times Touch all data once

      >
      >

      Speedups of Sym Band Reductionvs DSBTRD

      bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

      bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

      bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

      bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

      bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

      Nonsymmetric Eigenproblem

      bull No apparent way to modify standard algorithmbull Instead Spectral Divide-and-Conquer

      ndash Find orthogonal matrix Q whose leading columns span an invariant subspace of A

      ndash QTAQ will be block upper triangularndash Apply recursively to A11 A22

      ndash Depends on randomization1 Randomized Rank Revealing QR decomposition2 Randomized location to try splitting spectrum

      A11 A12

      ε A22

      Attaining the Lower bounds SequentialLegend[Existing][Ours][Math-Lib][Random]

      Two Levels Memory Hierarchy

      Words Messages Words Messages

      BLAS-3 [FLPRrsquo99][BDLSTrsquo13][MKL etc] [FLPRrsquo99][BDLSTrsquo13][MKL etc]

      Cholesky[Grsquo97][APrsquo00]

      [LAPACK][BDHSrsquo09]

      [Grsquo97][APrsquo00][BDHSrsquo09] [Grsquo97][APrsquo00][BDHSrsquo09]

      Sym Indefinite [BBDDDPSTYrsquo13] [BBDDDPSTYrsquo13]

      LU[Grsquo97][Trsquo97]

      [GDXrsquo11][BDLSTrsquo13]

      [GDXrsquo11][BDLSTrsquo13]

      [Grsquo97][Trsquo97] [BDLSTrsquo13] [BDLSTrsquo13]

      QR[EGrsquo98][FWrsquo03]

      [DGHLrsquo12][BDLSTrsquo13]

      [FWrsquo03][DGHLrsquo12][BDLSTrsquo13]

      [EGrsquo98][FWrsquo03][BDLSTrsquo13]

      [FWrsquo03][BDLSTrsquo13]

      Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

      Sym Eig amp SVD [BDDrsquo11][BDKrsquo13] [BDDrsquo11]

      Non Sym Eig [BDDrsquo11] [BDDrsquo11]

      Legend[Existing][Ours][Math-Lib][Random]

      Words (BW) Messages (L) Saving factor

      BLAS-3 [AGZrsquo94][MTrsquo99][ScaLAPACK][Crsquo69][vGWrsquo97][SDrsquo11] L nP12

      Cholesky [ScaLAPACK][Trsquo99][SDrsquo11] L nP12

      Sym Indefinite [BBDDDPSTYrsquo13][ScaLAPACK] [BBDDDPSTYrsquo13] L nP12

      LU [ScaLAPACK][GDXrsquo11][Trsquo99][SDrsquo11] [GDXrsquo11][Trsquo99][SDrsquo11] L nP12

      QR [ScaLAPACK][DGHLrsquo12] [Trsquo99] [DGHLrsquo12][Trsquo99] L nP12

      Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

      Sym Eig amp SVD [BDDrsquo11][BDKrsquo13][ScaLAPACK] [BDDrsquo11][BDKrsquo13] L nP12

      Non-Sym Eig [BDDrsquo11] [BDDrsquo11] BW P12 L n

      Attaining with extra memory 25D M=(cn2P)

      Attaining the Lower bounds Parallel 2DM=(n2P)(Ignoring poly-log(P) factors words = ( n2 P12) messages = (P12)

      Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

      ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

      ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

      bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

      bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

      Avoiding Communication in Iterative Linear Algebra

      bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

      bull Conjugate Gradients (CG) GMRES Lanczos Arnoldi hellip bull Goal minimize communication

      ndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

      bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

      ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

      bull Lots of speed up possible (modeled and measured)ndash Price some redundant computationndash Challenges Poor partitioning Preconditioning Num Stability

      75

      Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

      ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

      ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

      bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

      bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

      Example The Difficulty of Tuning SpMV

      bull n = 21200bull nnz = 15 M

      bull Source NASA structural analysis problem (raefsky)

      77

      Example The Difficulty of Tuning

      bull n = 21200bull nnz = 15 M

      bull Source NASA structural analysis problem (raefsky)

      bull 8x8 dense substructure exploit this to limit mem_refs

      78

      Speedups on Itanium 2 The Need for Search

      Reference

      Best 4x2

      Mflops

      Mflops

      79

      Register Profile Itanium 2

      190 Mflops

      1190 Mflops

      80

      Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16

      Itanium 2 - 33Itanium 1 - 8

      252 Mflops

      122 Mflops

      820 Mflops

      459 Mflops

      247 Mflops

      107 Mflops

      12 Gflops

      190 Mflops

      Another example of tuning challenges for SpMV

      bull Ex11 matrix (fluid flow)

      bull More complicated non-zero structure in general

      bull N = 16614bull NNZ = 11M

      82

      Zoom in to top corner

      bull More complicated non-zero structure in general

      bull N = 16614bull NNZ = 11M

      83

      3x3 blocks look natural buthellip

      bull Example 3x3 blockingndash Logical grid of 3x3 cells

      bull But would lead to lots of ldquofill-inrdquo

      84

      Extra Work Can Improve Efficiency

      bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

      bull On Pentium III 15x speedupndash Actual mflop rate 152 = 225 higher

      85

      Source Accelerator Cavity Design Problem (Ko via Husbands)

      86

      100x100 Submatrix Along Diagonal

      Summer School Lecture 787

      Post-RCM Reordering

      88

      Effect of Combined RCM+TSP Reordering

      Before Green + RedAfter Green + Blue

      Summer School Lecture 789

      2x speedups on Pentium 4 Power 4 hellip

      Summary of Other Performance Optimizations

      bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

      bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

      bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

      90

      Optimized Sparse Kernel Interface - OSKI

      bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

      bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

      bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

      software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

      91

      Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

      ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

      ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

      bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

      bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

      93

      Example Classical Conjugate Gradient (CG)

      SpMVs and dot products require communication in

      each iteration

      via CA Matrix Powers Kernel

      Global reduction to compute G

      94

      Example CA-Conjugate Gradient

      Local computations within inner loop require

      no communication

      Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

      ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

      ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

      bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

      bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

      96

      Slower convergence due

      to roundoff

      Loss of accuracy due to roundoff

      At s = 16 monomial basis is rank deficient Method breaks down

      Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

      CA-CG (monomial)CG

      machine precision

      97

      Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

      ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

      ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

      bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

      bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

      What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

      bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

      bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

      matrices

      Explicit (O(nnz)) Implicit (o(nnz))

      Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

      Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

      Indices

      Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

      ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

      ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

      bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

      bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

      101

      bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

      ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

      demanded by customers (construction engineers) otherwise they donrsquot believe results

      ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

      ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

      Reproducible Floating Point Computation

      Absolute Error for Random Vectors

      Same magnitude opposite signs

      Intel MKL non-reproducibility

      Relative Error for Orthogonal vectors

      Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

      Sign notreproducible

      103

      bull Consider summation or dot productbull Goals

      1 Same answer independent of layout processors order of summands

      2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

      bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

      GoalsApproaches for Reproducibility

      104

      Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

      Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

      Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

      bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

      Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

      bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

      Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

      Instruments NEC Nokia NVIDIA Samsung Oracle

      bull bebopcsberkeleyedu

      Summary

      Donrsquot Communichellip

      106

      Time to redesign all linear algebra n-body hellip algorithms and software

      (and compilers)

      • Implementing Communication-Avoiding Algorithms
      • Why avoid communication
      • Goals
      • Outline
      • Outline (2)
      • Lower bound for all ldquon3-likerdquo linear algebra
      • Lower bound for all ldquon3-likerdquo linear algebra (2)
      • Lower bound for all ldquon3-likerdquo linear algebra (3)
      • Limits to parallel scaling (12)
      • Limits to parallel scaling (22)
      • Can we attain these lower bounds
      • Outline (3)
      • 25D Matrix Multiplication
      • 25D Matrix Multiplication (2)
      • 25D Matmul on BGP 16K nodes 64K cores (2)
      • Perfect Strong Scaling ndash in Time and Energy (12)
      • Perfect Strong Scaling ndash in Time and Energy (22)
      • Handling Heterogeneity
      • Application to Tensor Contractions
      • C(ijk) = Σm A(ijm)B(mk)
      • Application to Tensor Contractions (2)
      • Communication Lower Bounds for Strassen-like matmul algorithms
      • vs
      • Slide 26
      • Strassen-like beyond matmul
      • Cache and Network Oblivious Algorithms
      • CARMA Performance Distributed Memory
      • CARMA Performance Distributed Memory (2)
      • CARMA Performance Shared Memory
      • CARMA Performance Shared Memory (2)
      • Why is CARMA Faster in Shared Memory
      • Outline (4)
      • One-sided Factorizations (LU QR) so far
      • TSQR An Architecture-Dependent Algorithm
      • Back to LU Using similar idea for TSLU as TSQR Use reduction
      • Minimizing Communication in TSLU
      • Making TSLU Numerically Stable
      • Stability of LU using TSLU CALU
      • Why is stability of TSLU just a ldquoThmrdquo
      • Fixing TSLU
      • 2D CALU with Tournament Pivoting
      • 25D CALU with Tournament Pivoting (c=4 copies)
      • Exascale Machine Parameters Source DOE Exascale Workshop
      • Exascale predicted speedups for Gaussian Elimination 2D CA
      • 25D vs 2D LU With and Without Pivoting
      • Other CA algorithms for Ax=b least squares(13)
      • Other CA algorithms for Ax=b least squares (23)
      • Other CA algorithms for Ax=b least squares (33)
      • Outline (5)
      • What about sparse matrices (13)
      • Performance of 25D APSP using Kleene
      • What about sparse matrices (23)
      • What about sparse matrices (33)
      • Outline (6)
      • Symmetric Eigenproblem and SVD
      • Slide 58
      • Slide 59
      • Slide 60
      • Slide 61
      • Slide 62
      • Slide 63
      • Slide 64
      • Slide 65
      • Slide 66
      • Slide 67
      • Slide 68
      • Conventional vs CA - SBR
      • Speedups of Sym Band Reduction vs DSBTRD
      • Nonsymmetric Eigenproblem
      • Attaining the Lower bounds Sequential
      • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
      • Outline (7)
      • Avoiding Communication in Iterative Linear Algebra
      • Outline (8)
      • Example The Difficulty of Tuning SpMV
      • Example The Difficulty of Tuning
      • Speedups on Itanium 2 The Need for Search
      • Register Profile Itanium 2
      • Register Profiles IBM and Intel IA-64
      • Another example of tuning challenges for SpMV
      • Zoom in to top corner
      • 3x3 blocks look natural buthellip
      • Extra Work Can Improve Efficiency
      • Slide 86
      • Slide 87
      • Slide 88
      • Slide 89
      • Summary of Other Performance Optimizations
      • Optimized Sparse Kernel Interface - OSKI
      • Outline (9)
      • Example Classical Conjugate Gradient (CG)
      • Example CA-Conjugate Gradient
      • Outline (10)
      • Slide 96
      • Slide 97
      • Outline (11)
      • What is a ldquosparse matrixrdquo
      • Outline (12)
      • Reproducible Floating Point Computation
      • Intel MKL non-reproducibility
      • GoalsApproaches for Reproducibility
      • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
      • Collaborators and Supporters
      • Summary

        Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

        ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

        ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

        bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

        bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

        Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

        ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

        ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

        bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

        bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

        Lower bound for all ldquon3-likerdquo linear algebra

        bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no

        matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)

        6

        bull Let M = ldquofastrdquo memory size (per processor)

        words_moved (per processor) = (flops (per processor) M12 )

        messages_sent (per processor) = (flops (per processor) M32 )

        bull Parallel case assume either load or memory balanced

        Lower bound for all ldquon3-likerdquo linear algebra

        bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no

        matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)

        7

        bull Let M = ldquofastrdquo memory size (per processor)

        words_moved (per processor) = (flops (per processor) M12 )

        messages_sent ge words_moved largest_message_size

        bull Parallel case assume either load or memory balanced

        Lower bound for all ldquon3-likerdquo linear algebra

        bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no

        matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)

        8

        bull Let M = ldquofastrdquo memory size (per processor)

        words_moved (per processor) = (flops (per processor) M12 )

        messages_sent (per processor) = (flops (per processor) M32 )

        bull Parallel case assume either load or memory balanced

        SIAM SIAGLinear Algebra Prize 2012Ballard D Holtz Schwartz

        Limits to parallel scaling (12)bull Consider dense case flops_per_proc = n3P

        ndash Words = (n3(PM12 ))ndash Messages = (n3(PM32 ))

        bull What is M Must be at least n2P to hold datandash Words = (n2P12 )ndash Messages = (P12 )

        bull But if M fixed looks like perfect strong scaling in timendash Flops Words Messages all proportional to 1P

        bull Ditto for energy if we count energy costs in joules hellipndash Per flop per word moved per messagendash Per word per second for data stored in memory Mndash Per second for leakage cooling hellip

        bull How big can we make P and M

        Limits to parallel scaling (22)

        bull Consider dense case flops_per_proc = n3Pndash Words = (n3(PM12 ))ndash Messages = (n3(PM32 ))

        bull How big can we make P and Mbull Assume we start with 1 copy of inputs A and B

        ndash Otherwise no communication may be neededbull Thm Words= (n2P23 ) independent of M

        bull Reached when M = n2P23 too or P = n3M32 and Messages = (1) (log P in practice)bull Attained by 25D algorithm when c=P13 (ldquo3D algrdquo)bull Can keep increasing P until P = n3 Words = Messages = (1) (log n in practice)

        Can we attain these lower bounds

        bull Do conventional dense algorithms as implemented in LAPACK and ScaLAPACK attain these boundsndash Often not

        bull If not are there other algorithms that dondash Yes for much of dense linear algebrandash New algorithms with new numerical properties

        new ways to encode answers new data structures

        ndash Not just loop transformations (need those too)bull Only a few sparse algorithms so farbull Lots of work in progress

        ndash Algorithms Energy Heterogeneous Processors hellip11

        Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

        ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

        ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

        bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

        bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

        25D Matrix Multiplication

        bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid

        c

        (Pc)12

        (Pc)12

        Example P = 32 c = 2

        25D Matrix Multiplication

        bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid

        k

        j

        iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12

        (1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)

        (3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)

        25D Matmul on BGP 16K nodes 64K coresc = 16 copies

        Distinguished Paper Award EuroParrsquo11 (Solomonik D)SCrsquo11 paper by Solomonik Bhatele D

        12x faster

        27x faster

        Perfect Strong Scaling ndash in Time and Energy (12)

        bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2

        bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model

        ndash γT βT αT = secs per flop per word_moved per message of size m

        bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ] = T(P)cbull Notation for energy model

        ndash γE βE αE = joules for same operations

        ndash δE = joules per word of memory used per sec

        ndash εE = joules per sec for leakage etc

        bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP) = E(P)bull Perfect scaling extends to N-body Strassen hellip

        Perfect Strong Scaling ndash in Time and Energy (22)

        bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ] = T(P)cbull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP) = E(P)

        bull Can use these formulas to answer many questions such asndash How to choose p and M to minimize energy E needed for computationndash Given max allowed runtime T what is minimum energy E needed to achieve itndash Given max allowed energy E what is the minimum runtime T attainablendash Can we minimize the average power P = ETndash Given target energy efficiency what architectural parameters are needed to

        achieve itbull Can we attain 75 GflopsWattbull Can we attain an exaflop for 20 MWatts

        Handling Heterogeneitybull Suppose each of P processors could differ

        ndash γi = secflop βi = secword αi = secmessage Mi = memory

        bull What is optimal assignment of work Fi to minimize timendash Ti = Fi γi + Fi βi Mi

        12 + Fi αi Mi32 = Fi [γi + βi Mi

        12 + αi Mi32] = Fi ξi

        ndash Choose Fi so Σi Fi = n3 and minimizing T = maxi Ti

        ndash Answer Fi = n3(1ξi)Σj(1ξj) and T = n3Σj(1ξj)

        bull Optimal Algorithm for nxn matmulndash Recursively divide into 8 half-sized subproblemsndash Assign subproblems to processor i to add up to Fi flops

        bull Works for Strassen other algorithmshellip

        Application to Tensor Contractions

        bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

        bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

        bull Heavily used in electronic structure calculationsndash Ex NWChem

        bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetries

        ndash Solomonik Hammond Matthews

        C(ijk) = Σm A(ijm)B(mk)

        A3-fold symm

        B2-fold symm

        C2-fold symm

        Application to Tensor Contractions

        bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

        bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

        bull Heavily used in electronic structure calculationsndash Ex NWChem for coupled cluster (CC) approach to Schroedinger eqn

        bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetriesndash Up to 3x faster running CC than NWChem on 3072 cores of Cray XE6 ndash Solomonik Hammond Matthews

        Communication Lower Bounds for Strassen-like matmul algorithms

        bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

        bull Extends up to M = n2 p2ω bull Extends to rectangular case multiply (mxn)(nxp) in q mults

        ndash words_moved = Ω (flopsM^(logmpq -1))

        bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz also in JACMbull Is the lower bound attainable

        Classical O(n3) matmul

        words_moved =Ω (M(nM12)3P)

        Strassenrsquos O(nlg7) matmul

        words_moved =Ω (M(nM12)lg7P)

        Strassen-like O(nω) matmul

        words_moved =Ω (M(nM12)ωP)

        vs

        Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

        Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

        CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

        Communication Avoiding Parallel Strassen (CAPS)

        Best way to interleaveBFS and DFS is an tuning parameter

        26

        Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

        Speedups 24-184(over previous Strassen-based algorithms)

        Invited to appear as Research Highlight in CACM

        Strassen-like beyond matmul

        bull Thm (D Dumitriu Holtzrsquo07) Any Strassen-like O(nω) matmul algorithm can be used to build a numerically stable O(nω+η) algorithm for any ηgt0 for Ax=b least squares eig SVD hellipndash ηgt0 needed to deal with numerical stabilityndash Strassen already stable so η=0

        bull Thm For sequential versions of these algorithms Words_moved = O(nω+ηM(ω+η)2 ndash 1 + n2 log n) ie attain expected lower bound

        Ballard D Holtz Schwartz

        Cache and Network Oblivious Algorithms

        bull Motivation Minimizes communication at every level of a hierarchical system without tuning parameters (in theory)ndash Not always 25D Matmul on BGP was topology aware

        bull CAPS Divide-and-conquer choose BFS or DFS to adapt to processors available memory

        bull CARMAndash Divide-and-conquer classical matmul divide largest of 3

        dimensions to create two subproblemsndash Choose BFS or DFS to adapt to processors available memory

        CARMA Performance Distributed Memory

        Square m = k = n = 6144

        ScaLAPACK

        CARMA

        Peak

        (log)

        (log)

        Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA

        CARMA Performance Distributed Memory

        Inner Product m = n = 192 k = 6291456

        ScaLAPACK

        CARMAPeak

        (log)

        (log)

        Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA

        CARMA Performance Shared Memory

        Square m = k = n

        MKL (double)CARMA (double)

        MKL (single)CARMA (single)

        Peak (single)

        Peak (double)

        (log)

        (linear)

        Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

        CARMA Performance Shared Memory

        Inner Product m = n = 64

        MKL (double)

        CARMA (double)

        MKL (single)

        CARMA (single)

        (log)

        (linear)

        Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

        Why is CARMA Faster in Shared MemoryL3 Cache Misses

        Shared Memory Inner Product (m = n = 64 k = 524288)

        97 Fewer Misses

        86 Fewer Misses

        (linear)

        Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

        ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

        ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

        bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

        bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

        One-sided Factorizations (LU QR) so farbull Classical Approach for i=1 to n update column i update trailing matrixbull words_moved = O(n3)

        35

        bull Blocked Approach (LAPACK) for i=1 to nb update block i of b columns update trailing matrixbull words moved = O(n3M13)

        bull Recursive Approach func factor(A) if A has 1 column update it

        else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)

        bull None of these approaches minimizes messagesbull Parallel case Partial

        Pivoting =gt n reductionsbull Need another idea

        TSQR An Architecture-Dependent Algorithm

        W =

        W0

        W1

        W2

        W3

        R00

        R10

        R20

        R30

        R01

        R11

        R02Parallel

        W =

        W0

        W1

        W2

        W3

        R01 R02

        R00

        R03

        SequentialStreaming

        W =

        W0

        W1

        W2

        W3

        R00

        R01R01

        R11

        R02

        R11

        R03

        Dual Core

        Can choose reduction tree dynamicallyMulticore Multisocket Multirack Multisite Out-of-core

        Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

        Wnxb =

        W1

        W2

        W3

        W4

        P1middotL1middotU1

        P2middotL2middotU2

        P3middotL3middotU3

        P4middotL4middotU4

        =

        Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

        W1rsquoW2rsquoW3rsquoW4rsquo

        P12middotL12middotU12

        P34middotL34middotU34

        =Choose b pivot rows call them W12rsquo

        Choose b pivot rows call them W34rsquo

        W12rsquoW34rsquo

        = P1234middotL1234middotU1234

        Choose b pivot rows

        Go back to W and use these b pivot rows (move them to top do LU without pivoting)

        37

        Minimizing Communication in TSLU

        W = W1

        W2

        W3

        W4

        LULULULU

        LU

        LULUParallel

        W = W1

        W2

        W3

        W4

        LULU

        LU

        LUSequentialStreaming

        W = W1

        W2

        W3

        W4

        LULU LU

        LULU

        LULU

        Dual Core

        Can choose reduction tree dynamically to match architecture as before

        38

        Making TSLU Numerically Stable

        bull Details matterndash Going up the tree we could do LU either on original rows of A

        (tournament pivoting) or computed rows of Undash Only tournament pivoting stable

        bull ldquoThmrdquo New scheme as stable as Partial Pivoting (GEPP) in following sense Get same Schur complements as GEPP applied to different input matrix whose entries are blocks taken from input A

        bull Why just a ldquoThmrdquo

        39

        Stability of LU using TSLU CALU

        Summer School Lecture 4 40

        bull Empirical testingndash Both random matrices and ldquospecial onesrdquondash Both binary tree (BCALU) and flat-tree (FCALU)ndash 3 metrics ||PA-LU||||A|| normwise and componentwise backward errorsndash See [D Grigori Xiang 2010] for details

        Why is stability of TSLU just a ldquoThmrdquo

        bull Proof is correct ndash in exact arithmeticbull Experiment

        ndash Generate 100 random 6x6 rank 3 matrices in Matlabndash [LUP] = lu(A) do LU without pivoting on PA compare L factors are

        they the samebull Compute || L ndash Lnp || A few 0rsquos A few infinrsquos a few NaNsbull Rest mostly O(1)

        ndash Why Floating point is nonassociative doing arithmetic in different order gives different rounding errors

        ndash Same experiment with rank 6 matrices || L ndash Lnp || usually nonzero O(macheps)

        ndash Same experiment with 20x20 rank 4 matrices || L ndash Lnp || often O(103)

        bull Much harder to break TSLU but possiblendash Occurred when using TSLU to factorize a low-rank subdiagonal

        panel in symmetric-indefinite factorization 41

        Fixing TSLU

        bull Run TSLU quickly test for stability fix if necessary (rare)

        bull Test conditioning of U if not tiny (usual case) proceed elsebull Compute || L || if not big (usual case) proceed elsebull Factor A = QR using TSQR thenbull Factor Q = PLU using TSLU thenbull A = PL(UR) with UR as upper triangular factor

        bull Last topic in lecture how to guarantee floating point reproducibility

        42

        2D CALU with Tournament Pivoting

        43

        25D CALU with Tournament Pivoting (c=4 copies)

        44

        Exascale Machine ParametersSource DOE Exascale Workshop

        bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

        Exascale predicted speedupsfor Gaussian Elimination

        2D CA-LU vs ScaLAPACK-LU

        log2 (p)

        log 2

        (n2 p

        ) =

        log 2

        (mem

        ory_

        per_

        proc

        )

        Up to 29x

        25D vs 2D LUWith and Without Pivoting

        Other CA algorithms for Ax=b least squares(13)

        bull A symmetric and indefinitendash Seek factorization that retains symmetry PAPT = LDLT D

        ldquosimplerdquobull Save frac12 flops preserve inertia

        ndash Usual approach Bunch-Kaufmanbull D block diagonal with 1x1 and 2x2 blocksbull Pivot search down column along row (lots of communication)

        ndash Alternative Aasenbull D = tridiagonal = Tbull Two steps

        ndash PAPT = LTLT where T is banded using TSLU

        48

        0 0

        0

        0 0

        0

        0

        hellip

        hellip

        ndash Solvefactor narrow band problem with Tbull Up to 28x faster than MKL Best Paper at IPDPSrsquo13

        Other CA algorithms for Ax=b least squares (23)bull Minimizing bandwidth and latency for sequential GEPP

        ndash So far could not do partial pivoting and minimize messages just words

        ndash Challengebull Column layout good for choosing pivots bad for matmulbull Blocked layout good for matmul bad for choosing pivots

        ndash Solution use both layouts switching between thembull ldquoShape Morphing LUrdquo or SMLU

        49

        bull func factor(A) if A has 1 column update it else factor(left half of A)

        update right half of A

        factor(right half of A)

        bull Words = O(n3M12)

        bull Messages = O(n3M)

        bull func factor(A) if A has 1 column update it else factor(left half of A) reshape to recursive block format update right half of A reshape to columnwise format factor(right half of A)

        bull Words = O(n3M12)

        bull Messages = O(n3M32)

        Other CA algorithms for Ax=b least squares (33)bull Need for pivoting arises beyond LU in QR

        ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)

        ndash Usual approach like Partial Pivotingbull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound

        ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two

        groups of b columns either using usual approach or something better (GuEisenstat)

        bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix

        ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 50

        Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

        ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

        ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

        bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

        bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

        What about sparse matrices (13)

        bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

        bull But canrsquot reorder outer loop for 25D need another ideabull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = AB

        ndash Dependencies ok 25D works just different semiringbull Kleenersquos Algorithm

        52

        for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

        D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

        Performance of 25D APSP using Kleene

        53

        Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

        62xspeedup

        2x speedup

        What about sparse matrices (23)

        bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of G(A)

        have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

        bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering for

        2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

        bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

        multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

        separators)

        54

        What about sparse matrices (33)

        bull If matrix stays very sparse lower bound unattainable new one bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

        data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

        bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

        Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

        bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

        along dimensions most likely to minimize cost55

        Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

        ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

        ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

        bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

        bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

        Symmetric Eigenproblem and SVD

        bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd

        bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12

        ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea

        b+1

        b+1

        Successive Band Reduction (BischofLangSun)

        1

        b+1

        b+1

        d+1

        c

        Successive Band Reduction (BischofLangSun)

        b = bandwidthc = columnsd = diagonalsConstraint c+d b

        1Q1

        b+1

        b+1

        d+1

        c

        b = bandwidthc = columnsd = diagonalsConstraint c+d b

        Successive Band Reduction (BischofLangSun)

        12

        Q1

        b+1

        b+1

        d+1

        d+c

        d+c

        c

        b = bandwidthc = columnsd = diagonalsConstraint c+d b

        Successive Band Reduction (BischofLangSun)

        1

        12

        Q1

        Q1T

        b+1

        b+1

        d+1

        d+1

        cd+c

        d+c

        c

        b = bandwidthc = columnsd = diagonalsConstraint c+d b

        Successive Band Reduction (BischofLangSun)

        1

        1

        2

        2Q1

        Q1T

        b+1

        b+1

        d+1

        d+1

        cd+c

        d+c

        d+c

        d+c

        c

        b = bandwidthc = columnsd = diagonalsConstraint c+d b

        Successive Band Reduction (BischofLangSun)

        1

        1

        2

        2

        3

        3

        Q1

        Q1T

        Q2

        Q2T

        b+1

        b+1

        d+1

        d+1

        d+c

        d+c

        d+c

        d+c

        c

        c

        b = bandwidthc = columnsd = diagonalsConstraint c+d b

        Successive Band Reduction (BischofLangSun)

        1

        1

        2

        2

        3

        3

        4

        4

        Q1

        Q1T

        Q2

        Q2T

        Q3

        Q3T

        b+1

        b+1

        d+1

        d+1

        d+c

        d+c

        d+c

        d+c

        c

        c

        b = bandwidthc = columnsd = diagonalsConstraint c+d b

        Successive Band Reduction (BischofLangSun)

        1

        1

        2

        2

        3

        3

        4

        4

        5

        5

        Q1

        Q1T

        Q2

        Q2T

        Q3

        Q3T

        Q4

        Q4T

        b+1

        b+1

        d+1

        d+1

        c

        c

        d+c

        d+c

        d+c

        d+c

        b = bandwidthc = columnsd = diagonalsConstraint c+d b

        Successive Band Reduction (BischofLangSun)

        1

        1

        2

        2

        3

        3

        4

        4

        5

        5

        Q5T

        Q1

        Q1T

        Q2

        Q2T

        Q3

        Q3T

        Q5

        Q4

        Q4T

        b+1

        b+1

        d+1

        d+1

        c

        c

        d+c

        d+c

        d+c

        d+c

        b = bandwidthc = columnsd = diagonalsConstraint c+d b

        Successive Band Reduction (BischofLangSun)

        1

        1

        2

        2

        3

        3

        4

        4

        5

        5

        6

        6

        Q5T

        Q1

        Q1T

        Q2

        Q2T

        Q3

        Q3T

        Q5

        Q4

        Q4T

        b+1

        b+1

        d+1

        d+1

        c

        c

        d+c

        d+c

        d+c

        d+c

        b = bandwidthc = columnsd = diagonalsConstraint c+d b

        Successive Band Reduction (BischofLangSun)

        Conventional vs CA - SBR

        Conventional Communication-Avoiding

        Touch all data 4 times Touch all data once

        >
        >

        Speedups of Sym Band Reductionvs DSBTRD

        bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

        bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

        bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

        bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

        bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

        Nonsymmetric Eigenproblem

        bull No apparent way to modify standard algorithmbull Instead Spectral Divide-and-Conquer

        ndash Find orthogonal matrix Q whose leading columns span an invariant subspace of A

        ndash QTAQ will be block upper triangularndash Apply recursively to A11 A22

        ndash Depends on randomization1 Randomized Rank Revealing QR decomposition2 Randomized location to try splitting spectrum

        A11 A12

        ε A22

        Attaining the Lower bounds SequentialLegend[Existing][Ours][Math-Lib][Random]

        Two Levels Memory Hierarchy

        Words Messages Words Messages

        BLAS-3 [FLPRrsquo99][BDLSTrsquo13][MKL etc] [FLPRrsquo99][BDLSTrsquo13][MKL etc]

        Cholesky[Grsquo97][APrsquo00]

        [LAPACK][BDHSrsquo09]

        [Grsquo97][APrsquo00][BDHSrsquo09] [Grsquo97][APrsquo00][BDHSrsquo09]

        Sym Indefinite [BBDDDPSTYrsquo13] [BBDDDPSTYrsquo13]

        LU[Grsquo97][Trsquo97]

        [GDXrsquo11][BDLSTrsquo13]

        [GDXrsquo11][BDLSTrsquo13]

        [Grsquo97][Trsquo97] [BDLSTrsquo13] [BDLSTrsquo13]

        QR[EGrsquo98][FWrsquo03]

        [DGHLrsquo12][BDLSTrsquo13]

        [FWrsquo03][DGHLrsquo12][BDLSTrsquo13]

        [EGrsquo98][FWrsquo03][BDLSTrsquo13]

        [FWrsquo03][BDLSTrsquo13]

        Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

        Sym Eig amp SVD [BDDrsquo11][BDKrsquo13] [BDDrsquo11]

        Non Sym Eig [BDDrsquo11] [BDDrsquo11]

        Legend[Existing][Ours][Math-Lib][Random]

        Words (BW) Messages (L) Saving factor

        BLAS-3 [AGZrsquo94][MTrsquo99][ScaLAPACK][Crsquo69][vGWrsquo97][SDrsquo11] L nP12

        Cholesky [ScaLAPACK][Trsquo99][SDrsquo11] L nP12

        Sym Indefinite [BBDDDPSTYrsquo13][ScaLAPACK] [BBDDDPSTYrsquo13] L nP12

        LU [ScaLAPACK][GDXrsquo11][Trsquo99][SDrsquo11] [GDXrsquo11][Trsquo99][SDrsquo11] L nP12

        QR [ScaLAPACK][DGHLrsquo12] [Trsquo99] [DGHLrsquo12][Trsquo99] L nP12

        Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

        Sym Eig amp SVD [BDDrsquo11][BDKrsquo13][ScaLAPACK] [BDDrsquo11][BDKrsquo13] L nP12

        Non-Sym Eig [BDDrsquo11] [BDDrsquo11] BW P12 L n

        Attaining with extra memory 25D M=(cn2P)

        Attaining the Lower bounds Parallel 2DM=(n2P)(Ignoring poly-log(P) factors words = ( n2 P12) messages = (P12)

        Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

        ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

        ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

        bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

        bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

        Avoiding Communication in Iterative Linear Algebra

        bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

        bull Conjugate Gradients (CG) GMRES Lanczos Arnoldi hellip bull Goal minimize communication

        ndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

        bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

        ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

        bull Lots of speed up possible (modeled and measured)ndash Price some redundant computationndash Challenges Poor partitioning Preconditioning Num Stability

        75

        Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

        ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

        ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

        bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

        bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

        Example The Difficulty of Tuning SpMV

        bull n = 21200bull nnz = 15 M

        bull Source NASA structural analysis problem (raefsky)

        77

        Example The Difficulty of Tuning

        bull n = 21200bull nnz = 15 M

        bull Source NASA structural analysis problem (raefsky)

        bull 8x8 dense substructure exploit this to limit mem_refs

        78

        Speedups on Itanium 2 The Need for Search

        Reference

        Best 4x2

        Mflops

        Mflops

        79

        Register Profile Itanium 2

        190 Mflops

        1190 Mflops

        80

        Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16

        Itanium 2 - 33Itanium 1 - 8

        252 Mflops

        122 Mflops

        820 Mflops

        459 Mflops

        247 Mflops

        107 Mflops

        12 Gflops

        190 Mflops

        Another example of tuning challenges for SpMV

        bull Ex11 matrix (fluid flow)

        bull More complicated non-zero structure in general

        bull N = 16614bull NNZ = 11M

        82

        Zoom in to top corner

        bull More complicated non-zero structure in general

        bull N = 16614bull NNZ = 11M

        83

        3x3 blocks look natural buthellip

        bull Example 3x3 blockingndash Logical grid of 3x3 cells

        bull But would lead to lots of ldquofill-inrdquo

        84

        Extra Work Can Improve Efficiency

        bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

        bull On Pentium III 15x speedupndash Actual mflop rate 152 = 225 higher

        85

        Source Accelerator Cavity Design Problem (Ko via Husbands)

        86

        100x100 Submatrix Along Diagonal

        Summer School Lecture 787

        Post-RCM Reordering

        88

        Effect of Combined RCM+TSP Reordering

        Before Green + RedAfter Green + Blue

        Summer School Lecture 789

        2x speedups on Pentium 4 Power 4 hellip

        Summary of Other Performance Optimizations

        bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

        bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

        bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

        90

        Optimized Sparse Kernel Interface - OSKI

        bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

        bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

        bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

        software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

        91

        Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

        ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

        ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

        bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

        bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

        93

        Example Classical Conjugate Gradient (CG)

        SpMVs and dot products require communication in

        each iteration

        via CA Matrix Powers Kernel

        Global reduction to compute G

        94

        Example CA-Conjugate Gradient

        Local computations within inner loop require

        no communication

        Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

        ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

        ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

        bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

        bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

        96

        Slower convergence due

        to roundoff

        Loss of accuracy due to roundoff

        At s = 16 monomial basis is rank deficient Method breaks down

        Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

        CA-CG (monomial)CG

        machine precision

        97

        Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

        ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

        ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

        bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

        bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

        What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

        bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

        bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

        matrices

        Explicit (O(nnz)) Implicit (o(nnz))

        Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

        Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

        Indices

        Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

        ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

        ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

        bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

        bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

        101

        bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

        ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

        demanded by customers (construction engineers) otherwise they donrsquot believe results

        ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

        ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

        Reproducible Floating Point Computation

        Absolute Error for Random Vectors

        Same magnitude opposite signs

        Intel MKL non-reproducibility

        Relative Error for Orthogonal vectors

        Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

        Sign notreproducible

        103

        bull Consider summation or dot productbull Goals

        1 Same answer independent of layout processors order of summands

        2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

        bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

        GoalsApproaches for Reproducibility

        104

        Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

        Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

        Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

        bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

        Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

        bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

        Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

        Instruments NEC Nokia NVIDIA Samsung Oracle

        bull bebopcsberkeleyedu

        Summary

        Donrsquot Communichellip

        106

        Time to redesign all linear algebra n-body hellip algorithms and software

        (and compilers)

        • Implementing Communication-Avoiding Algorithms
        • Why avoid communication
        • Goals
        • Outline
        • Outline (2)
        • Lower bound for all ldquon3-likerdquo linear algebra
        • Lower bound for all ldquon3-likerdquo linear algebra (2)
        • Lower bound for all ldquon3-likerdquo linear algebra (3)
        • Limits to parallel scaling (12)
        • Limits to parallel scaling (22)
        • Can we attain these lower bounds
        • Outline (3)
        • 25D Matrix Multiplication
        • 25D Matrix Multiplication (2)
        • 25D Matmul on BGP 16K nodes 64K cores (2)
        • Perfect Strong Scaling ndash in Time and Energy (12)
        • Perfect Strong Scaling ndash in Time and Energy (22)
        • Handling Heterogeneity
        • Application to Tensor Contractions
        • C(ijk) = Σm A(ijm)B(mk)
        • Application to Tensor Contractions (2)
        • Communication Lower Bounds for Strassen-like matmul algorithms
        • vs
        • Slide 26
        • Strassen-like beyond matmul
        • Cache and Network Oblivious Algorithms
        • CARMA Performance Distributed Memory
        • CARMA Performance Distributed Memory (2)
        • CARMA Performance Shared Memory
        • CARMA Performance Shared Memory (2)
        • Why is CARMA Faster in Shared Memory
        • Outline (4)
        • One-sided Factorizations (LU QR) so far
        • TSQR An Architecture-Dependent Algorithm
        • Back to LU Using similar idea for TSLU as TSQR Use reduction
        • Minimizing Communication in TSLU
        • Making TSLU Numerically Stable
        • Stability of LU using TSLU CALU
        • Why is stability of TSLU just a ldquoThmrdquo
        • Fixing TSLU
        • 2D CALU with Tournament Pivoting
        • 25D CALU with Tournament Pivoting (c=4 copies)
        • Exascale Machine Parameters Source DOE Exascale Workshop
        • Exascale predicted speedups for Gaussian Elimination 2D CA
        • 25D vs 2D LU With and Without Pivoting
        • Other CA algorithms for Ax=b least squares(13)
        • Other CA algorithms for Ax=b least squares (23)
        • Other CA algorithms for Ax=b least squares (33)
        • Outline (5)
        • What about sparse matrices (13)
        • Performance of 25D APSP using Kleene
        • What about sparse matrices (23)
        • What about sparse matrices (33)
        • Outline (6)
        • Symmetric Eigenproblem and SVD
        • Slide 58
        • Slide 59
        • Slide 60
        • Slide 61
        • Slide 62
        • Slide 63
        • Slide 64
        • Slide 65
        • Slide 66
        • Slide 67
        • Slide 68
        • Conventional vs CA - SBR
        • Speedups of Sym Band Reduction vs DSBTRD
        • Nonsymmetric Eigenproblem
        • Attaining the Lower bounds Sequential
        • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
        • Outline (7)
        • Avoiding Communication in Iterative Linear Algebra
        • Outline (8)
        • Example The Difficulty of Tuning SpMV
        • Example The Difficulty of Tuning
        • Speedups on Itanium 2 The Need for Search
        • Register Profile Itanium 2
        • Register Profiles IBM and Intel IA-64
        • Another example of tuning challenges for SpMV
        • Zoom in to top corner
        • 3x3 blocks look natural buthellip
        • Extra Work Can Improve Efficiency
        • Slide 86
        • Slide 87
        • Slide 88
        • Slide 89
        • Summary of Other Performance Optimizations
        • Optimized Sparse Kernel Interface - OSKI
        • Outline (9)
        • Example Classical Conjugate Gradient (CG)
        • Example CA-Conjugate Gradient
        • Outline (10)
        • Slide 96
        • Slide 97
        • Outline (11)
        • What is a ldquosparse matrixrdquo
        • Outline (12)
        • Reproducible Floating Point Computation
        • Intel MKL non-reproducibility
        • GoalsApproaches for Reproducibility
        • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
        • Collaborators and Supporters
        • Summary

          Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

          ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

          ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

          bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

          bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

          Lower bound for all ldquon3-likerdquo linear algebra

          bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no

          matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)

          6

          bull Let M = ldquofastrdquo memory size (per processor)

          words_moved (per processor) = (flops (per processor) M12 )

          messages_sent (per processor) = (flops (per processor) M32 )

          bull Parallel case assume either load or memory balanced

          Lower bound for all ldquon3-likerdquo linear algebra

          bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no

          matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)

          7

          bull Let M = ldquofastrdquo memory size (per processor)

          words_moved (per processor) = (flops (per processor) M12 )

          messages_sent ge words_moved largest_message_size

          bull Parallel case assume either load or memory balanced

          Lower bound for all ldquon3-likerdquo linear algebra

          bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no

          matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)

          8

          bull Let M = ldquofastrdquo memory size (per processor)

          words_moved (per processor) = (flops (per processor) M12 )

          messages_sent (per processor) = (flops (per processor) M32 )

          bull Parallel case assume either load or memory balanced

          SIAM SIAGLinear Algebra Prize 2012Ballard D Holtz Schwartz

          Limits to parallel scaling (12)bull Consider dense case flops_per_proc = n3P

          ndash Words = (n3(PM12 ))ndash Messages = (n3(PM32 ))

          bull What is M Must be at least n2P to hold datandash Words = (n2P12 )ndash Messages = (P12 )

          bull But if M fixed looks like perfect strong scaling in timendash Flops Words Messages all proportional to 1P

          bull Ditto for energy if we count energy costs in joules hellipndash Per flop per word moved per messagendash Per word per second for data stored in memory Mndash Per second for leakage cooling hellip

          bull How big can we make P and M

          Limits to parallel scaling (22)

          bull Consider dense case flops_per_proc = n3Pndash Words = (n3(PM12 ))ndash Messages = (n3(PM32 ))

          bull How big can we make P and Mbull Assume we start with 1 copy of inputs A and B

          ndash Otherwise no communication may be neededbull Thm Words= (n2P23 ) independent of M

          bull Reached when M = n2P23 too or P = n3M32 and Messages = (1) (log P in practice)bull Attained by 25D algorithm when c=P13 (ldquo3D algrdquo)bull Can keep increasing P until P = n3 Words = Messages = (1) (log n in practice)

          Can we attain these lower bounds

          bull Do conventional dense algorithms as implemented in LAPACK and ScaLAPACK attain these boundsndash Often not

          bull If not are there other algorithms that dondash Yes for much of dense linear algebrandash New algorithms with new numerical properties

          new ways to encode answers new data structures

          ndash Not just loop transformations (need those too)bull Only a few sparse algorithms so farbull Lots of work in progress

          ndash Algorithms Energy Heterogeneous Processors hellip11

          Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

          ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

          ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

          bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

          bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

          25D Matrix Multiplication

          bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid

          c

          (Pc)12

          (Pc)12

          Example P = 32 c = 2

          25D Matrix Multiplication

          bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid

          k

          j

          iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12

          (1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)

          (3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)

          25D Matmul on BGP 16K nodes 64K coresc = 16 copies

          Distinguished Paper Award EuroParrsquo11 (Solomonik D)SCrsquo11 paper by Solomonik Bhatele D

          12x faster

          27x faster

          Perfect Strong Scaling ndash in Time and Energy (12)

          bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2

          bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model

          ndash γT βT αT = secs per flop per word_moved per message of size m

          bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ] = T(P)cbull Notation for energy model

          ndash γE βE αE = joules for same operations

          ndash δE = joules per word of memory used per sec

          ndash εE = joules per sec for leakage etc

          bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP) = E(P)bull Perfect scaling extends to N-body Strassen hellip

          Perfect Strong Scaling ndash in Time and Energy (22)

          bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ] = T(P)cbull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP) = E(P)

          bull Can use these formulas to answer many questions such asndash How to choose p and M to minimize energy E needed for computationndash Given max allowed runtime T what is minimum energy E needed to achieve itndash Given max allowed energy E what is the minimum runtime T attainablendash Can we minimize the average power P = ETndash Given target energy efficiency what architectural parameters are needed to

          achieve itbull Can we attain 75 GflopsWattbull Can we attain an exaflop for 20 MWatts

          Handling Heterogeneitybull Suppose each of P processors could differ

          ndash γi = secflop βi = secword αi = secmessage Mi = memory

          bull What is optimal assignment of work Fi to minimize timendash Ti = Fi γi + Fi βi Mi

          12 + Fi αi Mi32 = Fi [γi + βi Mi

          12 + αi Mi32] = Fi ξi

          ndash Choose Fi so Σi Fi = n3 and minimizing T = maxi Ti

          ndash Answer Fi = n3(1ξi)Σj(1ξj) and T = n3Σj(1ξj)

          bull Optimal Algorithm for nxn matmulndash Recursively divide into 8 half-sized subproblemsndash Assign subproblems to processor i to add up to Fi flops

          bull Works for Strassen other algorithmshellip

          Application to Tensor Contractions

          bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

          bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

          bull Heavily used in electronic structure calculationsndash Ex NWChem

          bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetries

          ndash Solomonik Hammond Matthews

          C(ijk) = Σm A(ijm)B(mk)

          A3-fold symm

          B2-fold symm

          C2-fold symm

          Application to Tensor Contractions

          bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

          bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

          bull Heavily used in electronic structure calculationsndash Ex NWChem for coupled cluster (CC) approach to Schroedinger eqn

          bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetriesndash Up to 3x faster running CC than NWChem on 3072 cores of Cray XE6 ndash Solomonik Hammond Matthews

          Communication Lower Bounds for Strassen-like matmul algorithms

          bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

          bull Extends up to M = n2 p2ω bull Extends to rectangular case multiply (mxn)(nxp) in q mults

          ndash words_moved = Ω (flopsM^(logmpq -1))

          bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz also in JACMbull Is the lower bound attainable

          Classical O(n3) matmul

          words_moved =Ω (M(nM12)3P)

          Strassenrsquos O(nlg7) matmul

          words_moved =Ω (M(nM12)lg7P)

          Strassen-like O(nω) matmul

          words_moved =Ω (M(nM12)ωP)

          vs

          Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

          Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

          CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

          Communication Avoiding Parallel Strassen (CAPS)

          Best way to interleaveBFS and DFS is an tuning parameter

          26

          Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

          Speedups 24-184(over previous Strassen-based algorithms)

          Invited to appear as Research Highlight in CACM

          Strassen-like beyond matmul

          bull Thm (D Dumitriu Holtzrsquo07) Any Strassen-like O(nω) matmul algorithm can be used to build a numerically stable O(nω+η) algorithm for any ηgt0 for Ax=b least squares eig SVD hellipndash ηgt0 needed to deal with numerical stabilityndash Strassen already stable so η=0

          bull Thm For sequential versions of these algorithms Words_moved = O(nω+ηM(ω+η)2 ndash 1 + n2 log n) ie attain expected lower bound

          Ballard D Holtz Schwartz

          Cache and Network Oblivious Algorithms

          bull Motivation Minimizes communication at every level of a hierarchical system without tuning parameters (in theory)ndash Not always 25D Matmul on BGP was topology aware

          bull CAPS Divide-and-conquer choose BFS or DFS to adapt to processors available memory

          bull CARMAndash Divide-and-conquer classical matmul divide largest of 3

          dimensions to create two subproblemsndash Choose BFS or DFS to adapt to processors available memory

          CARMA Performance Distributed Memory

          Square m = k = n = 6144

          ScaLAPACK

          CARMA

          Peak

          (log)

          (log)

          Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA

          CARMA Performance Distributed Memory

          Inner Product m = n = 192 k = 6291456

          ScaLAPACK

          CARMAPeak

          (log)

          (log)

          Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA

          CARMA Performance Shared Memory

          Square m = k = n

          MKL (double)CARMA (double)

          MKL (single)CARMA (single)

          Peak (single)

          Peak (double)

          (log)

          (linear)

          Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

          CARMA Performance Shared Memory

          Inner Product m = n = 64

          MKL (double)

          CARMA (double)

          MKL (single)

          CARMA (single)

          (log)

          (linear)

          Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

          Why is CARMA Faster in Shared MemoryL3 Cache Misses

          Shared Memory Inner Product (m = n = 64 k = 524288)

          97 Fewer Misses

          86 Fewer Misses

          (linear)

          Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

          ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

          ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

          bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

          bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

          One-sided Factorizations (LU QR) so farbull Classical Approach for i=1 to n update column i update trailing matrixbull words_moved = O(n3)

          35

          bull Blocked Approach (LAPACK) for i=1 to nb update block i of b columns update trailing matrixbull words moved = O(n3M13)

          bull Recursive Approach func factor(A) if A has 1 column update it

          else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)

          bull None of these approaches minimizes messagesbull Parallel case Partial

          Pivoting =gt n reductionsbull Need another idea

          TSQR An Architecture-Dependent Algorithm

          W =

          W0

          W1

          W2

          W3

          R00

          R10

          R20

          R30

          R01

          R11

          R02Parallel

          W =

          W0

          W1

          W2

          W3

          R01 R02

          R00

          R03

          SequentialStreaming

          W =

          W0

          W1

          W2

          W3

          R00

          R01R01

          R11

          R02

          R11

          R03

          Dual Core

          Can choose reduction tree dynamicallyMulticore Multisocket Multirack Multisite Out-of-core

          Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

          Wnxb =

          W1

          W2

          W3

          W4

          P1middotL1middotU1

          P2middotL2middotU2

          P3middotL3middotU3

          P4middotL4middotU4

          =

          Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

          W1rsquoW2rsquoW3rsquoW4rsquo

          P12middotL12middotU12

          P34middotL34middotU34

          =Choose b pivot rows call them W12rsquo

          Choose b pivot rows call them W34rsquo

          W12rsquoW34rsquo

          = P1234middotL1234middotU1234

          Choose b pivot rows

          Go back to W and use these b pivot rows (move them to top do LU without pivoting)

          37

          Minimizing Communication in TSLU

          W = W1

          W2

          W3

          W4

          LULULULU

          LU

          LULUParallel

          W = W1

          W2

          W3

          W4

          LULU

          LU

          LUSequentialStreaming

          W = W1

          W2

          W3

          W4

          LULU LU

          LULU

          LULU

          Dual Core

          Can choose reduction tree dynamically to match architecture as before

          38

          Making TSLU Numerically Stable

          bull Details matterndash Going up the tree we could do LU either on original rows of A

          (tournament pivoting) or computed rows of Undash Only tournament pivoting stable

          bull ldquoThmrdquo New scheme as stable as Partial Pivoting (GEPP) in following sense Get same Schur complements as GEPP applied to different input matrix whose entries are blocks taken from input A

          bull Why just a ldquoThmrdquo

          39

          Stability of LU using TSLU CALU

          Summer School Lecture 4 40

          bull Empirical testingndash Both random matrices and ldquospecial onesrdquondash Both binary tree (BCALU) and flat-tree (FCALU)ndash 3 metrics ||PA-LU||||A|| normwise and componentwise backward errorsndash See [D Grigori Xiang 2010] for details

          Why is stability of TSLU just a ldquoThmrdquo

          bull Proof is correct ndash in exact arithmeticbull Experiment

          ndash Generate 100 random 6x6 rank 3 matrices in Matlabndash [LUP] = lu(A) do LU without pivoting on PA compare L factors are

          they the samebull Compute || L ndash Lnp || A few 0rsquos A few infinrsquos a few NaNsbull Rest mostly O(1)

          ndash Why Floating point is nonassociative doing arithmetic in different order gives different rounding errors

          ndash Same experiment with rank 6 matrices || L ndash Lnp || usually nonzero O(macheps)

          ndash Same experiment with 20x20 rank 4 matrices || L ndash Lnp || often O(103)

          bull Much harder to break TSLU but possiblendash Occurred when using TSLU to factorize a low-rank subdiagonal

          panel in symmetric-indefinite factorization 41

          Fixing TSLU

          bull Run TSLU quickly test for stability fix if necessary (rare)

          bull Test conditioning of U if not tiny (usual case) proceed elsebull Compute || L || if not big (usual case) proceed elsebull Factor A = QR using TSQR thenbull Factor Q = PLU using TSLU thenbull A = PL(UR) with UR as upper triangular factor

          bull Last topic in lecture how to guarantee floating point reproducibility

          42

          2D CALU with Tournament Pivoting

          43

          25D CALU with Tournament Pivoting (c=4 copies)

          44

          Exascale Machine ParametersSource DOE Exascale Workshop

          bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

          Exascale predicted speedupsfor Gaussian Elimination

          2D CA-LU vs ScaLAPACK-LU

          log2 (p)

          log 2

          (n2 p

          ) =

          log 2

          (mem

          ory_

          per_

          proc

          )

          Up to 29x

          25D vs 2D LUWith and Without Pivoting

          Other CA algorithms for Ax=b least squares(13)

          bull A symmetric and indefinitendash Seek factorization that retains symmetry PAPT = LDLT D

          ldquosimplerdquobull Save frac12 flops preserve inertia

          ndash Usual approach Bunch-Kaufmanbull D block diagonal with 1x1 and 2x2 blocksbull Pivot search down column along row (lots of communication)

          ndash Alternative Aasenbull D = tridiagonal = Tbull Two steps

          ndash PAPT = LTLT where T is banded using TSLU

          48

          0 0

          0

          0 0

          0

          0

          hellip

          hellip

          ndash Solvefactor narrow band problem with Tbull Up to 28x faster than MKL Best Paper at IPDPSrsquo13

          Other CA algorithms for Ax=b least squares (23)bull Minimizing bandwidth and latency for sequential GEPP

          ndash So far could not do partial pivoting and minimize messages just words

          ndash Challengebull Column layout good for choosing pivots bad for matmulbull Blocked layout good for matmul bad for choosing pivots

          ndash Solution use both layouts switching between thembull ldquoShape Morphing LUrdquo or SMLU

          49

          bull func factor(A) if A has 1 column update it else factor(left half of A)

          update right half of A

          factor(right half of A)

          bull Words = O(n3M12)

          bull Messages = O(n3M)

          bull func factor(A) if A has 1 column update it else factor(left half of A) reshape to recursive block format update right half of A reshape to columnwise format factor(right half of A)

          bull Words = O(n3M12)

          bull Messages = O(n3M32)

          Other CA algorithms for Ax=b least squares (33)bull Need for pivoting arises beyond LU in QR

          ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)

          ndash Usual approach like Partial Pivotingbull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound

          ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two

          groups of b columns either using usual approach or something better (GuEisenstat)

          bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix

          ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 50

          Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

          ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

          ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

          bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

          bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

          What about sparse matrices (13)

          bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

          bull But canrsquot reorder outer loop for 25D need another ideabull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = AB

          ndash Dependencies ok 25D works just different semiringbull Kleenersquos Algorithm

          52

          for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

          D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

          Performance of 25D APSP using Kleene

          53

          Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

          62xspeedup

          2x speedup

          What about sparse matrices (23)

          bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of G(A)

          have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

          bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering for

          2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

          bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

          multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

          separators)

          54

          What about sparse matrices (33)

          bull If matrix stays very sparse lower bound unattainable new one bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

          data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

          bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

          Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

          bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

          along dimensions most likely to minimize cost55

          Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

          ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

          ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

          bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

          bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

          Symmetric Eigenproblem and SVD

          bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd

          bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12

          ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea

          b+1

          b+1

          Successive Band Reduction (BischofLangSun)

          1

          b+1

          b+1

          d+1

          c

          Successive Band Reduction (BischofLangSun)

          b = bandwidthc = columnsd = diagonalsConstraint c+d b

          1Q1

          b+1

          b+1

          d+1

          c

          b = bandwidthc = columnsd = diagonalsConstraint c+d b

          Successive Band Reduction (BischofLangSun)

          12

          Q1

          b+1

          b+1

          d+1

          d+c

          d+c

          c

          b = bandwidthc = columnsd = diagonalsConstraint c+d b

          Successive Band Reduction (BischofLangSun)

          1

          12

          Q1

          Q1T

          b+1

          b+1

          d+1

          d+1

          cd+c

          d+c

          c

          b = bandwidthc = columnsd = diagonalsConstraint c+d b

          Successive Band Reduction (BischofLangSun)

          1

          1

          2

          2Q1

          Q1T

          b+1

          b+1

          d+1

          d+1

          cd+c

          d+c

          d+c

          d+c

          c

          b = bandwidthc = columnsd = diagonalsConstraint c+d b

          Successive Band Reduction (BischofLangSun)

          1

          1

          2

          2

          3

          3

          Q1

          Q1T

          Q2

          Q2T

          b+1

          b+1

          d+1

          d+1

          d+c

          d+c

          d+c

          d+c

          c

          c

          b = bandwidthc = columnsd = diagonalsConstraint c+d b

          Successive Band Reduction (BischofLangSun)

          1

          1

          2

          2

          3

          3

          4

          4

          Q1

          Q1T

          Q2

          Q2T

          Q3

          Q3T

          b+1

          b+1

          d+1

          d+1

          d+c

          d+c

          d+c

          d+c

          c

          c

          b = bandwidthc = columnsd = diagonalsConstraint c+d b

          Successive Band Reduction (BischofLangSun)

          1

          1

          2

          2

          3

          3

          4

          4

          5

          5

          Q1

          Q1T

          Q2

          Q2T

          Q3

          Q3T

          Q4

          Q4T

          b+1

          b+1

          d+1

          d+1

          c

          c

          d+c

          d+c

          d+c

          d+c

          b = bandwidthc = columnsd = diagonalsConstraint c+d b

          Successive Band Reduction (BischofLangSun)

          1

          1

          2

          2

          3

          3

          4

          4

          5

          5

          Q5T

          Q1

          Q1T

          Q2

          Q2T

          Q3

          Q3T

          Q5

          Q4

          Q4T

          b+1

          b+1

          d+1

          d+1

          c

          c

          d+c

          d+c

          d+c

          d+c

          b = bandwidthc = columnsd = diagonalsConstraint c+d b

          Successive Band Reduction (BischofLangSun)

          1

          1

          2

          2

          3

          3

          4

          4

          5

          5

          6

          6

          Q5T

          Q1

          Q1T

          Q2

          Q2T

          Q3

          Q3T

          Q5

          Q4

          Q4T

          b+1

          b+1

          d+1

          d+1

          c

          c

          d+c

          d+c

          d+c

          d+c

          b = bandwidthc = columnsd = diagonalsConstraint c+d b

          Successive Band Reduction (BischofLangSun)

          Conventional vs CA - SBR

          Conventional Communication-Avoiding

          Touch all data 4 times Touch all data once

          >
          >

          Speedups of Sym Band Reductionvs DSBTRD

          bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

          bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

          bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

          bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

          bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

          Nonsymmetric Eigenproblem

          bull No apparent way to modify standard algorithmbull Instead Spectral Divide-and-Conquer

          ndash Find orthogonal matrix Q whose leading columns span an invariant subspace of A

          ndash QTAQ will be block upper triangularndash Apply recursively to A11 A22

          ndash Depends on randomization1 Randomized Rank Revealing QR decomposition2 Randomized location to try splitting spectrum

          A11 A12

          ε A22

          Attaining the Lower bounds SequentialLegend[Existing][Ours][Math-Lib][Random]

          Two Levels Memory Hierarchy

          Words Messages Words Messages

          BLAS-3 [FLPRrsquo99][BDLSTrsquo13][MKL etc] [FLPRrsquo99][BDLSTrsquo13][MKL etc]

          Cholesky[Grsquo97][APrsquo00]

          [LAPACK][BDHSrsquo09]

          [Grsquo97][APrsquo00][BDHSrsquo09] [Grsquo97][APrsquo00][BDHSrsquo09]

          Sym Indefinite [BBDDDPSTYrsquo13] [BBDDDPSTYrsquo13]

          LU[Grsquo97][Trsquo97]

          [GDXrsquo11][BDLSTrsquo13]

          [GDXrsquo11][BDLSTrsquo13]

          [Grsquo97][Trsquo97] [BDLSTrsquo13] [BDLSTrsquo13]

          QR[EGrsquo98][FWrsquo03]

          [DGHLrsquo12][BDLSTrsquo13]

          [FWrsquo03][DGHLrsquo12][BDLSTrsquo13]

          [EGrsquo98][FWrsquo03][BDLSTrsquo13]

          [FWrsquo03][BDLSTrsquo13]

          Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

          Sym Eig amp SVD [BDDrsquo11][BDKrsquo13] [BDDrsquo11]

          Non Sym Eig [BDDrsquo11] [BDDrsquo11]

          Legend[Existing][Ours][Math-Lib][Random]

          Words (BW) Messages (L) Saving factor

          BLAS-3 [AGZrsquo94][MTrsquo99][ScaLAPACK][Crsquo69][vGWrsquo97][SDrsquo11] L nP12

          Cholesky [ScaLAPACK][Trsquo99][SDrsquo11] L nP12

          Sym Indefinite [BBDDDPSTYrsquo13][ScaLAPACK] [BBDDDPSTYrsquo13] L nP12

          LU [ScaLAPACK][GDXrsquo11][Trsquo99][SDrsquo11] [GDXrsquo11][Trsquo99][SDrsquo11] L nP12

          QR [ScaLAPACK][DGHLrsquo12] [Trsquo99] [DGHLrsquo12][Trsquo99] L nP12

          Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

          Sym Eig amp SVD [BDDrsquo11][BDKrsquo13][ScaLAPACK] [BDDrsquo11][BDKrsquo13] L nP12

          Non-Sym Eig [BDDrsquo11] [BDDrsquo11] BW P12 L n

          Attaining with extra memory 25D M=(cn2P)

          Attaining the Lower bounds Parallel 2DM=(n2P)(Ignoring poly-log(P) factors words = ( n2 P12) messages = (P12)

          Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

          ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

          ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

          bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

          bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

          Avoiding Communication in Iterative Linear Algebra

          bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

          bull Conjugate Gradients (CG) GMRES Lanczos Arnoldi hellip bull Goal minimize communication

          ndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

          bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

          ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

          bull Lots of speed up possible (modeled and measured)ndash Price some redundant computationndash Challenges Poor partitioning Preconditioning Num Stability

          75

          Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

          ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

          ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

          bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

          bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

          Example The Difficulty of Tuning SpMV

          bull n = 21200bull nnz = 15 M

          bull Source NASA structural analysis problem (raefsky)

          77

          Example The Difficulty of Tuning

          bull n = 21200bull nnz = 15 M

          bull Source NASA structural analysis problem (raefsky)

          bull 8x8 dense substructure exploit this to limit mem_refs

          78

          Speedups on Itanium 2 The Need for Search

          Reference

          Best 4x2

          Mflops

          Mflops

          79

          Register Profile Itanium 2

          190 Mflops

          1190 Mflops

          80

          Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16

          Itanium 2 - 33Itanium 1 - 8

          252 Mflops

          122 Mflops

          820 Mflops

          459 Mflops

          247 Mflops

          107 Mflops

          12 Gflops

          190 Mflops

          Another example of tuning challenges for SpMV

          bull Ex11 matrix (fluid flow)

          bull More complicated non-zero structure in general

          bull N = 16614bull NNZ = 11M

          82

          Zoom in to top corner

          bull More complicated non-zero structure in general

          bull N = 16614bull NNZ = 11M

          83

          3x3 blocks look natural buthellip

          bull Example 3x3 blockingndash Logical grid of 3x3 cells

          bull But would lead to lots of ldquofill-inrdquo

          84

          Extra Work Can Improve Efficiency

          bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

          bull On Pentium III 15x speedupndash Actual mflop rate 152 = 225 higher

          85

          Source Accelerator Cavity Design Problem (Ko via Husbands)

          86

          100x100 Submatrix Along Diagonal

          Summer School Lecture 787

          Post-RCM Reordering

          88

          Effect of Combined RCM+TSP Reordering

          Before Green + RedAfter Green + Blue

          Summer School Lecture 789

          2x speedups on Pentium 4 Power 4 hellip

          Summary of Other Performance Optimizations

          bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

          bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

          bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

          90

          Optimized Sparse Kernel Interface - OSKI

          bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

          bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

          bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

          software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

          91

          Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

          ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

          ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

          bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

          bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

          93

          Example Classical Conjugate Gradient (CG)

          SpMVs and dot products require communication in

          each iteration

          via CA Matrix Powers Kernel

          Global reduction to compute G

          94

          Example CA-Conjugate Gradient

          Local computations within inner loop require

          no communication

          Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

          ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

          ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

          bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

          bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

          96

          Slower convergence due

          to roundoff

          Loss of accuracy due to roundoff

          At s = 16 monomial basis is rank deficient Method breaks down

          Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

          CA-CG (monomial)CG

          machine precision

          97

          Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

          ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

          ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

          bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

          bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

          What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

          bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

          bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

          matrices

          Explicit (O(nnz)) Implicit (o(nnz))

          Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

          Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

          Indices

          Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

          ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

          ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

          bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

          bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

          101

          bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

          ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

          demanded by customers (construction engineers) otherwise they donrsquot believe results

          ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

          ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

          Reproducible Floating Point Computation

          Absolute Error for Random Vectors

          Same magnitude opposite signs

          Intel MKL non-reproducibility

          Relative Error for Orthogonal vectors

          Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

          Sign notreproducible

          103

          bull Consider summation or dot productbull Goals

          1 Same answer independent of layout processors order of summands

          2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

          bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

          GoalsApproaches for Reproducibility

          104

          Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

          Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

          Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

          bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

          Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

          bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

          Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

          Instruments NEC Nokia NVIDIA Samsung Oracle

          bull bebopcsberkeleyedu

          Summary

          Donrsquot Communichellip

          106

          Time to redesign all linear algebra n-body hellip algorithms and software

          (and compilers)

          • Implementing Communication-Avoiding Algorithms
          • Why avoid communication
          • Goals
          • Outline
          • Outline (2)
          • Lower bound for all ldquon3-likerdquo linear algebra
          • Lower bound for all ldquon3-likerdquo linear algebra (2)
          • Lower bound for all ldquon3-likerdquo linear algebra (3)
          • Limits to parallel scaling (12)
          • Limits to parallel scaling (22)
          • Can we attain these lower bounds
          • Outline (3)
          • 25D Matrix Multiplication
          • 25D Matrix Multiplication (2)
          • 25D Matmul on BGP 16K nodes 64K cores (2)
          • Perfect Strong Scaling ndash in Time and Energy (12)
          • Perfect Strong Scaling ndash in Time and Energy (22)
          • Handling Heterogeneity
          • Application to Tensor Contractions
          • C(ijk) = Σm A(ijm)B(mk)
          • Application to Tensor Contractions (2)
          • Communication Lower Bounds for Strassen-like matmul algorithms
          • vs
          • Slide 26
          • Strassen-like beyond matmul
          • Cache and Network Oblivious Algorithms
          • CARMA Performance Distributed Memory
          • CARMA Performance Distributed Memory (2)
          • CARMA Performance Shared Memory
          • CARMA Performance Shared Memory (2)
          • Why is CARMA Faster in Shared Memory
          • Outline (4)
          • One-sided Factorizations (LU QR) so far
          • TSQR An Architecture-Dependent Algorithm
          • Back to LU Using similar idea for TSLU as TSQR Use reduction
          • Minimizing Communication in TSLU
          • Making TSLU Numerically Stable
          • Stability of LU using TSLU CALU
          • Why is stability of TSLU just a ldquoThmrdquo
          • Fixing TSLU
          • 2D CALU with Tournament Pivoting
          • 25D CALU with Tournament Pivoting (c=4 copies)
          • Exascale Machine Parameters Source DOE Exascale Workshop
          • Exascale predicted speedups for Gaussian Elimination 2D CA
          • 25D vs 2D LU With and Without Pivoting
          • Other CA algorithms for Ax=b least squares(13)
          • Other CA algorithms for Ax=b least squares (23)
          • Other CA algorithms for Ax=b least squares (33)
          • Outline (5)
          • What about sparse matrices (13)
          • Performance of 25D APSP using Kleene
          • What about sparse matrices (23)
          • What about sparse matrices (33)
          • Outline (6)
          • Symmetric Eigenproblem and SVD
          • Slide 58
          • Slide 59
          • Slide 60
          • Slide 61
          • Slide 62
          • Slide 63
          • Slide 64
          • Slide 65
          • Slide 66
          • Slide 67
          • Slide 68
          • Conventional vs CA - SBR
          • Speedups of Sym Band Reduction vs DSBTRD
          • Nonsymmetric Eigenproblem
          • Attaining the Lower bounds Sequential
          • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
          • Outline (7)
          • Avoiding Communication in Iterative Linear Algebra
          • Outline (8)
          • Example The Difficulty of Tuning SpMV
          • Example The Difficulty of Tuning
          • Speedups on Itanium 2 The Need for Search
          • Register Profile Itanium 2
          • Register Profiles IBM and Intel IA-64
          • Another example of tuning challenges for SpMV
          • Zoom in to top corner
          • 3x3 blocks look natural buthellip
          • Extra Work Can Improve Efficiency
          • Slide 86
          • Slide 87
          • Slide 88
          • Slide 89
          • Summary of Other Performance Optimizations
          • Optimized Sparse Kernel Interface - OSKI
          • Outline (9)
          • Example Classical Conjugate Gradient (CG)
          • Example CA-Conjugate Gradient
          • Outline (10)
          • Slide 96
          • Slide 97
          • Outline (11)
          • What is a ldquosparse matrixrdquo
          • Outline (12)
          • Reproducible Floating Point Computation
          • Intel MKL non-reproducibility
          • GoalsApproaches for Reproducibility
          • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
          • Collaborators and Supporters
          • Summary

            Lower bound for all ldquon3-likerdquo linear algebra

            bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no

            matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)

            6

            bull Let M = ldquofastrdquo memory size (per processor)

            words_moved (per processor) = (flops (per processor) M12 )

            messages_sent (per processor) = (flops (per processor) M32 )

            bull Parallel case assume either load or memory balanced

            Lower bound for all ldquon3-likerdquo linear algebra

            bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no

            matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)

            7

            bull Let M = ldquofastrdquo memory size (per processor)

            words_moved (per processor) = (flops (per processor) M12 )

            messages_sent ge words_moved largest_message_size

            bull Parallel case assume either load or memory balanced

            Lower bound for all ldquon3-likerdquo linear algebra

            bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no

            matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)

            8

            bull Let M = ldquofastrdquo memory size (per processor)

            words_moved (per processor) = (flops (per processor) M12 )

            messages_sent (per processor) = (flops (per processor) M32 )

            bull Parallel case assume either load or memory balanced

            SIAM SIAGLinear Algebra Prize 2012Ballard D Holtz Schwartz

            Limits to parallel scaling (12)bull Consider dense case flops_per_proc = n3P

            ndash Words = (n3(PM12 ))ndash Messages = (n3(PM32 ))

            bull What is M Must be at least n2P to hold datandash Words = (n2P12 )ndash Messages = (P12 )

            bull But if M fixed looks like perfect strong scaling in timendash Flops Words Messages all proportional to 1P

            bull Ditto for energy if we count energy costs in joules hellipndash Per flop per word moved per messagendash Per word per second for data stored in memory Mndash Per second for leakage cooling hellip

            bull How big can we make P and M

            Limits to parallel scaling (22)

            bull Consider dense case flops_per_proc = n3Pndash Words = (n3(PM12 ))ndash Messages = (n3(PM32 ))

            bull How big can we make P and Mbull Assume we start with 1 copy of inputs A and B

            ndash Otherwise no communication may be neededbull Thm Words= (n2P23 ) independent of M

            bull Reached when M = n2P23 too or P = n3M32 and Messages = (1) (log P in practice)bull Attained by 25D algorithm when c=P13 (ldquo3D algrdquo)bull Can keep increasing P until P = n3 Words = Messages = (1) (log n in practice)

            Can we attain these lower bounds

            bull Do conventional dense algorithms as implemented in LAPACK and ScaLAPACK attain these boundsndash Often not

            bull If not are there other algorithms that dondash Yes for much of dense linear algebrandash New algorithms with new numerical properties

            new ways to encode answers new data structures

            ndash Not just loop transformations (need those too)bull Only a few sparse algorithms so farbull Lots of work in progress

            ndash Algorithms Energy Heterogeneous Processors hellip11

            Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

            ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

            ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

            bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

            bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

            25D Matrix Multiplication

            bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid

            c

            (Pc)12

            (Pc)12

            Example P = 32 c = 2

            25D Matrix Multiplication

            bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid

            k

            j

            iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12

            (1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)

            (3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)

            25D Matmul on BGP 16K nodes 64K coresc = 16 copies

            Distinguished Paper Award EuroParrsquo11 (Solomonik D)SCrsquo11 paper by Solomonik Bhatele D

            12x faster

            27x faster

            Perfect Strong Scaling ndash in Time and Energy (12)

            bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2

            bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model

            ndash γT βT αT = secs per flop per word_moved per message of size m

            bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ] = T(P)cbull Notation for energy model

            ndash γE βE αE = joules for same operations

            ndash δE = joules per word of memory used per sec

            ndash εE = joules per sec for leakage etc

            bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP) = E(P)bull Perfect scaling extends to N-body Strassen hellip

            Perfect Strong Scaling ndash in Time and Energy (22)

            bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ] = T(P)cbull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP) = E(P)

            bull Can use these formulas to answer many questions such asndash How to choose p and M to minimize energy E needed for computationndash Given max allowed runtime T what is minimum energy E needed to achieve itndash Given max allowed energy E what is the minimum runtime T attainablendash Can we minimize the average power P = ETndash Given target energy efficiency what architectural parameters are needed to

            achieve itbull Can we attain 75 GflopsWattbull Can we attain an exaflop for 20 MWatts

            Handling Heterogeneitybull Suppose each of P processors could differ

            ndash γi = secflop βi = secword αi = secmessage Mi = memory

            bull What is optimal assignment of work Fi to minimize timendash Ti = Fi γi + Fi βi Mi

            12 + Fi αi Mi32 = Fi [γi + βi Mi

            12 + αi Mi32] = Fi ξi

            ndash Choose Fi so Σi Fi = n3 and minimizing T = maxi Ti

            ndash Answer Fi = n3(1ξi)Σj(1ξj) and T = n3Σj(1ξj)

            bull Optimal Algorithm for nxn matmulndash Recursively divide into 8 half-sized subproblemsndash Assign subproblems to processor i to add up to Fi flops

            bull Works for Strassen other algorithmshellip

            Application to Tensor Contractions

            bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

            bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

            bull Heavily used in electronic structure calculationsndash Ex NWChem

            bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetries

            ndash Solomonik Hammond Matthews

            C(ijk) = Σm A(ijm)B(mk)

            A3-fold symm

            B2-fold symm

            C2-fold symm

            Application to Tensor Contractions

            bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

            bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

            bull Heavily used in electronic structure calculationsndash Ex NWChem for coupled cluster (CC) approach to Schroedinger eqn

            bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetriesndash Up to 3x faster running CC than NWChem on 3072 cores of Cray XE6 ndash Solomonik Hammond Matthews

            Communication Lower Bounds for Strassen-like matmul algorithms

            bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

            bull Extends up to M = n2 p2ω bull Extends to rectangular case multiply (mxn)(nxp) in q mults

            ndash words_moved = Ω (flopsM^(logmpq -1))

            bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz also in JACMbull Is the lower bound attainable

            Classical O(n3) matmul

            words_moved =Ω (M(nM12)3P)

            Strassenrsquos O(nlg7) matmul

            words_moved =Ω (M(nM12)lg7P)

            Strassen-like O(nω) matmul

            words_moved =Ω (M(nM12)ωP)

            vs

            Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

            Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

            CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

            Communication Avoiding Parallel Strassen (CAPS)

            Best way to interleaveBFS and DFS is an tuning parameter

            26

            Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

            Speedups 24-184(over previous Strassen-based algorithms)

            Invited to appear as Research Highlight in CACM

            Strassen-like beyond matmul

            bull Thm (D Dumitriu Holtzrsquo07) Any Strassen-like O(nω) matmul algorithm can be used to build a numerically stable O(nω+η) algorithm for any ηgt0 for Ax=b least squares eig SVD hellipndash ηgt0 needed to deal with numerical stabilityndash Strassen already stable so η=0

            bull Thm For sequential versions of these algorithms Words_moved = O(nω+ηM(ω+η)2 ndash 1 + n2 log n) ie attain expected lower bound

            Ballard D Holtz Schwartz

            Cache and Network Oblivious Algorithms

            bull Motivation Minimizes communication at every level of a hierarchical system without tuning parameters (in theory)ndash Not always 25D Matmul on BGP was topology aware

            bull CAPS Divide-and-conquer choose BFS or DFS to adapt to processors available memory

            bull CARMAndash Divide-and-conquer classical matmul divide largest of 3

            dimensions to create two subproblemsndash Choose BFS or DFS to adapt to processors available memory

            CARMA Performance Distributed Memory

            Square m = k = n = 6144

            ScaLAPACK

            CARMA

            Peak

            (log)

            (log)

            Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA

            CARMA Performance Distributed Memory

            Inner Product m = n = 192 k = 6291456

            ScaLAPACK

            CARMAPeak

            (log)

            (log)

            Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA

            CARMA Performance Shared Memory

            Square m = k = n

            MKL (double)CARMA (double)

            MKL (single)CARMA (single)

            Peak (single)

            Peak (double)

            (log)

            (linear)

            Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

            CARMA Performance Shared Memory

            Inner Product m = n = 64

            MKL (double)

            CARMA (double)

            MKL (single)

            CARMA (single)

            (log)

            (linear)

            Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

            Why is CARMA Faster in Shared MemoryL3 Cache Misses

            Shared Memory Inner Product (m = n = 64 k = 524288)

            97 Fewer Misses

            86 Fewer Misses

            (linear)

            Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

            ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

            ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

            bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

            bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

            One-sided Factorizations (LU QR) so farbull Classical Approach for i=1 to n update column i update trailing matrixbull words_moved = O(n3)

            35

            bull Blocked Approach (LAPACK) for i=1 to nb update block i of b columns update trailing matrixbull words moved = O(n3M13)

            bull Recursive Approach func factor(A) if A has 1 column update it

            else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)

            bull None of these approaches minimizes messagesbull Parallel case Partial

            Pivoting =gt n reductionsbull Need another idea

            TSQR An Architecture-Dependent Algorithm

            W =

            W0

            W1

            W2

            W3

            R00

            R10

            R20

            R30

            R01

            R11

            R02Parallel

            W =

            W0

            W1

            W2

            W3

            R01 R02

            R00

            R03

            SequentialStreaming

            W =

            W0

            W1

            W2

            W3

            R00

            R01R01

            R11

            R02

            R11

            R03

            Dual Core

            Can choose reduction tree dynamicallyMulticore Multisocket Multirack Multisite Out-of-core

            Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

            Wnxb =

            W1

            W2

            W3

            W4

            P1middotL1middotU1

            P2middotL2middotU2

            P3middotL3middotU3

            P4middotL4middotU4

            =

            Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

            W1rsquoW2rsquoW3rsquoW4rsquo

            P12middotL12middotU12

            P34middotL34middotU34

            =Choose b pivot rows call them W12rsquo

            Choose b pivot rows call them W34rsquo

            W12rsquoW34rsquo

            = P1234middotL1234middotU1234

            Choose b pivot rows

            Go back to W and use these b pivot rows (move them to top do LU without pivoting)

            37

            Minimizing Communication in TSLU

            W = W1

            W2

            W3

            W4

            LULULULU

            LU

            LULUParallel

            W = W1

            W2

            W3

            W4

            LULU

            LU

            LUSequentialStreaming

            W = W1

            W2

            W3

            W4

            LULU LU

            LULU

            LULU

            Dual Core

            Can choose reduction tree dynamically to match architecture as before

            38

            Making TSLU Numerically Stable

            bull Details matterndash Going up the tree we could do LU either on original rows of A

            (tournament pivoting) or computed rows of Undash Only tournament pivoting stable

            bull ldquoThmrdquo New scheme as stable as Partial Pivoting (GEPP) in following sense Get same Schur complements as GEPP applied to different input matrix whose entries are blocks taken from input A

            bull Why just a ldquoThmrdquo

            39

            Stability of LU using TSLU CALU

            Summer School Lecture 4 40

            bull Empirical testingndash Both random matrices and ldquospecial onesrdquondash Both binary tree (BCALU) and flat-tree (FCALU)ndash 3 metrics ||PA-LU||||A|| normwise and componentwise backward errorsndash See [D Grigori Xiang 2010] for details

            Why is stability of TSLU just a ldquoThmrdquo

            bull Proof is correct ndash in exact arithmeticbull Experiment

            ndash Generate 100 random 6x6 rank 3 matrices in Matlabndash [LUP] = lu(A) do LU without pivoting on PA compare L factors are

            they the samebull Compute || L ndash Lnp || A few 0rsquos A few infinrsquos a few NaNsbull Rest mostly O(1)

            ndash Why Floating point is nonassociative doing arithmetic in different order gives different rounding errors

            ndash Same experiment with rank 6 matrices || L ndash Lnp || usually nonzero O(macheps)

            ndash Same experiment with 20x20 rank 4 matrices || L ndash Lnp || often O(103)

            bull Much harder to break TSLU but possiblendash Occurred when using TSLU to factorize a low-rank subdiagonal

            panel in symmetric-indefinite factorization 41

            Fixing TSLU

            bull Run TSLU quickly test for stability fix if necessary (rare)

            bull Test conditioning of U if not tiny (usual case) proceed elsebull Compute || L || if not big (usual case) proceed elsebull Factor A = QR using TSQR thenbull Factor Q = PLU using TSLU thenbull A = PL(UR) with UR as upper triangular factor

            bull Last topic in lecture how to guarantee floating point reproducibility

            42

            2D CALU with Tournament Pivoting

            43

            25D CALU with Tournament Pivoting (c=4 copies)

            44

            Exascale Machine ParametersSource DOE Exascale Workshop

            bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

            Exascale predicted speedupsfor Gaussian Elimination

            2D CA-LU vs ScaLAPACK-LU

            log2 (p)

            log 2

            (n2 p

            ) =

            log 2

            (mem

            ory_

            per_

            proc

            )

            Up to 29x

            25D vs 2D LUWith and Without Pivoting

            Other CA algorithms for Ax=b least squares(13)

            bull A symmetric and indefinitendash Seek factorization that retains symmetry PAPT = LDLT D

            ldquosimplerdquobull Save frac12 flops preserve inertia

            ndash Usual approach Bunch-Kaufmanbull D block diagonal with 1x1 and 2x2 blocksbull Pivot search down column along row (lots of communication)

            ndash Alternative Aasenbull D = tridiagonal = Tbull Two steps

            ndash PAPT = LTLT where T is banded using TSLU

            48

            0 0

            0

            0 0

            0

            0

            hellip

            hellip

            ndash Solvefactor narrow band problem with Tbull Up to 28x faster than MKL Best Paper at IPDPSrsquo13

            Other CA algorithms for Ax=b least squares (23)bull Minimizing bandwidth and latency for sequential GEPP

            ndash So far could not do partial pivoting and minimize messages just words

            ndash Challengebull Column layout good for choosing pivots bad for matmulbull Blocked layout good for matmul bad for choosing pivots

            ndash Solution use both layouts switching between thembull ldquoShape Morphing LUrdquo or SMLU

            49

            bull func factor(A) if A has 1 column update it else factor(left half of A)

            update right half of A

            factor(right half of A)

            bull Words = O(n3M12)

            bull Messages = O(n3M)

            bull func factor(A) if A has 1 column update it else factor(left half of A) reshape to recursive block format update right half of A reshape to columnwise format factor(right half of A)

            bull Words = O(n3M12)

            bull Messages = O(n3M32)

            Other CA algorithms for Ax=b least squares (33)bull Need for pivoting arises beyond LU in QR

            ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)

            ndash Usual approach like Partial Pivotingbull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound

            ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two

            groups of b columns either using usual approach or something better (GuEisenstat)

            bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix

            ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 50

            Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

            ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

            ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

            bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

            bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

            What about sparse matrices (13)

            bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

            bull But canrsquot reorder outer loop for 25D need another ideabull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = AB

            ndash Dependencies ok 25D works just different semiringbull Kleenersquos Algorithm

            52

            for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

            D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

            Performance of 25D APSP using Kleene

            53

            Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

            62xspeedup

            2x speedup

            What about sparse matrices (23)

            bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of G(A)

            have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

            bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering for

            2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

            bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

            multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

            separators)

            54

            What about sparse matrices (33)

            bull If matrix stays very sparse lower bound unattainable new one bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

            data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

            bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

            Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

            bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

            along dimensions most likely to minimize cost55

            Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

            ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

            ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

            bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

            bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

            Symmetric Eigenproblem and SVD

            bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd

            bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12

            ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea

            b+1

            b+1

            Successive Band Reduction (BischofLangSun)

            1

            b+1

            b+1

            d+1

            c

            Successive Band Reduction (BischofLangSun)

            b = bandwidthc = columnsd = diagonalsConstraint c+d b

            1Q1

            b+1

            b+1

            d+1

            c

            b = bandwidthc = columnsd = diagonalsConstraint c+d b

            Successive Band Reduction (BischofLangSun)

            12

            Q1

            b+1

            b+1

            d+1

            d+c

            d+c

            c

            b = bandwidthc = columnsd = diagonalsConstraint c+d b

            Successive Band Reduction (BischofLangSun)

            1

            12

            Q1

            Q1T

            b+1

            b+1

            d+1

            d+1

            cd+c

            d+c

            c

            b = bandwidthc = columnsd = diagonalsConstraint c+d b

            Successive Band Reduction (BischofLangSun)

            1

            1

            2

            2Q1

            Q1T

            b+1

            b+1

            d+1

            d+1

            cd+c

            d+c

            d+c

            d+c

            c

            b = bandwidthc = columnsd = diagonalsConstraint c+d b

            Successive Band Reduction (BischofLangSun)

            1

            1

            2

            2

            3

            3

            Q1

            Q1T

            Q2

            Q2T

            b+1

            b+1

            d+1

            d+1

            d+c

            d+c

            d+c

            d+c

            c

            c

            b = bandwidthc = columnsd = diagonalsConstraint c+d b

            Successive Band Reduction (BischofLangSun)

            1

            1

            2

            2

            3

            3

            4

            4

            Q1

            Q1T

            Q2

            Q2T

            Q3

            Q3T

            b+1

            b+1

            d+1

            d+1

            d+c

            d+c

            d+c

            d+c

            c

            c

            b = bandwidthc = columnsd = diagonalsConstraint c+d b

            Successive Band Reduction (BischofLangSun)

            1

            1

            2

            2

            3

            3

            4

            4

            5

            5

            Q1

            Q1T

            Q2

            Q2T

            Q3

            Q3T

            Q4

            Q4T

            b+1

            b+1

            d+1

            d+1

            c

            c

            d+c

            d+c

            d+c

            d+c

            b = bandwidthc = columnsd = diagonalsConstraint c+d b

            Successive Band Reduction (BischofLangSun)

            1

            1

            2

            2

            3

            3

            4

            4

            5

            5

            Q5T

            Q1

            Q1T

            Q2

            Q2T

            Q3

            Q3T

            Q5

            Q4

            Q4T

            b+1

            b+1

            d+1

            d+1

            c

            c

            d+c

            d+c

            d+c

            d+c

            b = bandwidthc = columnsd = diagonalsConstraint c+d b

            Successive Band Reduction (BischofLangSun)

            1

            1

            2

            2

            3

            3

            4

            4

            5

            5

            6

            6

            Q5T

            Q1

            Q1T

            Q2

            Q2T

            Q3

            Q3T

            Q5

            Q4

            Q4T

            b+1

            b+1

            d+1

            d+1

            c

            c

            d+c

            d+c

            d+c

            d+c

            b = bandwidthc = columnsd = diagonalsConstraint c+d b

            Successive Band Reduction (BischofLangSun)

            Conventional vs CA - SBR

            Conventional Communication-Avoiding

            Touch all data 4 times Touch all data once

            >
            >

            Speedups of Sym Band Reductionvs DSBTRD

            bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

            bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

            bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

            bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

            bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

            Nonsymmetric Eigenproblem

            bull No apparent way to modify standard algorithmbull Instead Spectral Divide-and-Conquer

            ndash Find orthogonal matrix Q whose leading columns span an invariant subspace of A

            ndash QTAQ will be block upper triangularndash Apply recursively to A11 A22

            ndash Depends on randomization1 Randomized Rank Revealing QR decomposition2 Randomized location to try splitting spectrum

            A11 A12

            ε A22

            Attaining the Lower bounds SequentialLegend[Existing][Ours][Math-Lib][Random]

            Two Levels Memory Hierarchy

            Words Messages Words Messages

            BLAS-3 [FLPRrsquo99][BDLSTrsquo13][MKL etc] [FLPRrsquo99][BDLSTrsquo13][MKL etc]

            Cholesky[Grsquo97][APrsquo00]

            [LAPACK][BDHSrsquo09]

            [Grsquo97][APrsquo00][BDHSrsquo09] [Grsquo97][APrsquo00][BDHSrsquo09]

            Sym Indefinite [BBDDDPSTYrsquo13] [BBDDDPSTYrsquo13]

            LU[Grsquo97][Trsquo97]

            [GDXrsquo11][BDLSTrsquo13]

            [GDXrsquo11][BDLSTrsquo13]

            [Grsquo97][Trsquo97] [BDLSTrsquo13] [BDLSTrsquo13]

            QR[EGrsquo98][FWrsquo03]

            [DGHLrsquo12][BDLSTrsquo13]

            [FWrsquo03][DGHLrsquo12][BDLSTrsquo13]

            [EGrsquo98][FWrsquo03][BDLSTrsquo13]

            [FWrsquo03][BDLSTrsquo13]

            Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

            Sym Eig amp SVD [BDDrsquo11][BDKrsquo13] [BDDrsquo11]

            Non Sym Eig [BDDrsquo11] [BDDrsquo11]

            Legend[Existing][Ours][Math-Lib][Random]

            Words (BW) Messages (L) Saving factor

            BLAS-3 [AGZrsquo94][MTrsquo99][ScaLAPACK][Crsquo69][vGWrsquo97][SDrsquo11] L nP12

            Cholesky [ScaLAPACK][Trsquo99][SDrsquo11] L nP12

            Sym Indefinite [BBDDDPSTYrsquo13][ScaLAPACK] [BBDDDPSTYrsquo13] L nP12

            LU [ScaLAPACK][GDXrsquo11][Trsquo99][SDrsquo11] [GDXrsquo11][Trsquo99][SDrsquo11] L nP12

            QR [ScaLAPACK][DGHLrsquo12] [Trsquo99] [DGHLrsquo12][Trsquo99] L nP12

            Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

            Sym Eig amp SVD [BDDrsquo11][BDKrsquo13][ScaLAPACK] [BDDrsquo11][BDKrsquo13] L nP12

            Non-Sym Eig [BDDrsquo11] [BDDrsquo11] BW P12 L n

            Attaining with extra memory 25D M=(cn2P)

            Attaining the Lower bounds Parallel 2DM=(n2P)(Ignoring poly-log(P) factors words = ( n2 P12) messages = (P12)

            Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

            ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

            ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

            bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

            bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

            Avoiding Communication in Iterative Linear Algebra

            bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

            bull Conjugate Gradients (CG) GMRES Lanczos Arnoldi hellip bull Goal minimize communication

            ndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

            bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

            ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

            bull Lots of speed up possible (modeled and measured)ndash Price some redundant computationndash Challenges Poor partitioning Preconditioning Num Stability

            75

            Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

            ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

            ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

            bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

            bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

            Example The Difficulty of Tuning SpMV

            bull n = 21200bull nnz = 15 M

            bull Source NASA structural analysis problem (raefsky)

            77

            Example The Difficulty of Tuning

            bull n = 21200bull nnz = 15 M

            bull Source NASA structural analysis problem (raefsky)

            bull 8x8 dense substructure exploit this to limit mem_refs

            78

            Speedups on Itanium 2 The Need for Search

            Reference

            Best 4x2

            Mflops

            Mflops

            79

            Register Profile Itanium 2

            190 Mflops

            1190 Mflops

            80

            Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16

            Itanium 2 - 33Itanium 1 - 8

            252 Mflops

            122 Mflops

            820 Mflops

            459 Mflops

            247 Mflops

            107 Mflops

            12 Gflops

            190 Mflops

            Another example of tuning challenges for SpMV

            bull Ex11 matrix (fluid flow)

            bull More complicated non-zero structure in general

            bull N = 16614bull NNZ = 11M

            82

            Zoom in to top corner

            bull More complicated non-zero structure in general

            bull N = 16614bull NNZ = 11M

            83

            3x3 blocks look natural buthellip

            bull Example 3x3 blockingndash Logical grid of 3x3 cells

            bull But would lead to lots of ldquofill-inrdquo

            84

            Extra Work Can Improve Efficiency

            bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

            bull On Pentium III 15x speedupndash Actual mflop rate 152 = 225 higher

            85

            Source Accelerator Cavity Design Problem (Ko via Husbands)

            86

            100x100 Submatrix Along Diagonal

            Summer School Lecture 787

            Post-RCM Reordering

            88

            Effect of Combined RCM+TSP Reordering

            Before Green + RedAfter Green + Blue

            Summer School Lecture 789

            2x speedups on Pentium 4 Power 4 hellip

            Summary of Other Performance Optimizations

            bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

            bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

            bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

            90

            Optimized Sparse Kernel Interface - OSKI

            bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

            bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

            bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

            software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

            91

            Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

            ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

            ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

            bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

            bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

            93

            Example Classical Conjugate Gradient (CG)

            SpMVs and dot products require communication in

            each iteration

            via CA Matrix Powers Kernel

            Global reduction to compute G

            94

            Example CA-Conjugate Gradient

            Local computations within inner loop require

            no communication

            Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

            ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

            ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

            bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

            bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

            96

            Slower convergence due

            to roundoff

            Loss of accuracy due to roundoff

            At s = 16 monomial basis is rank deficient Method breaks down

            Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

            CA-CG (monomial)CG

            machine precision

            97

            Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

            ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

            ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

            bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

            bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

            What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

            bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

            bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

            matrices

            Explicit (O(nnz)) Implicit (o(nnz))

            Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

            Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

            Indices

            Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

            ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

            ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

            bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

            bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

            101

            bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

            ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

            demanded by customers (construction engineers) otherwise they donrsquot believe results

            ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

            ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

            Reproducible Floating Point Computation

            Absolute Error for Random Vectors

            Same magnitude opposite signs

            Intel MKL non-reproducibility

            Relative Error for Orthogonal vectors

            Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

            Sign notreproducible

            103

            bull Consider summation or dot productbull Goals

            1 Same answer independent of layout processors order of summands

            2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

            bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

            GoalsApproaches for Reproducibility

            104

            Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

            Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

            Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

            bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

            Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

            bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

            Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

            Instruments NEC Nokia NVIDIA Samsung Oracle

            bull bebopcsberkeleyedu

            Summary

            Donrsquot Communichellip

            106

            Time to redesign all linear algebra n-body hellip algorithms and software

            (and compilers)

            • Implementing Communication-Avoiding Algorithms
            • Why avoid communication
            • Goals
            • Outline
            • Outline (2)
            • Lower bound for all ldquon3-likerdquo linear algebra
            • Lower bound for all ldquon3-likerdquo linear algebra (2)
            • Lower bound for all ldquon3-likerdquo linear algebra (3)
            • Limits to parallel scaling (12)
            • Limits to parallel scaling (22)
            • Can we attain these lower bounds
            • Outline (3)
            • 25D Matrix Multiplication
            • 25D Matrix Multiplication (2)
            • 25D Matmul on BGP 16K nodes 64K cores (2)
            • Perfect Strong Scaling ndash in Time and Energy (12)
            • Perfect Strong Scaling ndash in Time and Energy (22)
            • Handling Heterogeneity
            • Application to Tensor Contractions
            • C(ijk) = Σm A(ijm)B(mk)
            • Application to Tensor Contractions (2)
            • Communication Lower Bounds for Strassen-like matmul algorithms
            • vs
            • Slide 26
            • Strassen-like beyond matmul
            • Cache and Network Oblivious Algorithms
            • CARMA Performance Distributed Memory
            • CARMA Performance Distributed Memory (2)
            • CARMA Performance Shared Memory
            • CARMA Performance Shared Memory (2)
            • Why is CARMA Faster in Shared Memory
            • Outline (4)
            • One-sided Factorizations (LU QR) so far
            • TSQR An Architecture-Dependent Algorithm
            • Back to LU Using similar idea for TSLU as TSQR Use reduction
            • Minimizing Communication in TSLU
            • Making TSLU Numerically Stable
            • Stability of LU using TSLU CALU
            • Why is stability of TSLU just a ldquoThmrdquo
            • Fixing TSLU
            • 2D CALU with Tournament Pivoting
            • 25D CALU with Tournament Pivoting (c=4 copies)
            • Exascale Machine Parameters Source DOE Exascale Workshop
            • Exascale predicted speedups for Gaussian Elimination 2D CA
            • 25D vs 2D LU With and Without Pivoting
            • Other CA algorithms for Ax=b least squares(13)
            • Other CA algorithms for Ax=b least squares (23)
            • Other CA algorithms for Ax=b least squares (33)
            • Outline (5)
            • What about sparse matrices (13)
            • Performance of 25D APSP using Kleene
            • What about sparse matrices (23)
            • What about sparse matrices (33)
            • Outline (6)
            • Symmetric Eigenproblem and SVD
            • Slide 58
            • Slide 59
            • Slide 60
            • Slide 61
            • Slide 62
            • Slide 63
            • Slide 64
            • Slide 65
            • Slide 66
            • Slide 67
            • Slide 68
            • Conventional vs CA - SBR
            • Speedups of Sym Band Reduction vs DSBTRD
            • Nonsymmetric Eigenproblem
            • Attaining the Lower bounds Sequential
            • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
            • Outline (7)
            • Avoiding Communication in Iterative Linear Algebra
            • Outline (8)
            • Example The Difficulty of Tuning SpMV
            • Example The Difficulty of Tuning
            • Speedups on Itanium 2 The Need for Search
            • Register Profile Itanium 2
            • Register Profiles IBM and Intel IA-64
            • Another example of tuning challenges for SpMV
            • Zoom in to top corner
            • 3x3 blocks look natural buthellip
            • Extra Work Can Improve Efficiency
            • Slide 86
            • Slide 87
            • Slide 88
            • Slide 89
            • Summary of Other Performance Optimizations
            • Optimized Sparse Kernel Interface - OSKI
            • Outline (9)
            • Example Classical Conjugate Gradient (CG)
            • Example CA-Conjugate Gradient
            • Outline (10)
            • Slide 96
            • Slide 97
            • Outline (11)
            • What is a ldquosparse matrixrdquo
            • Outline (12)
            • Reproducible Floating Point Computation
            • Intel MKL non-reproducibility
            • GoalsApproaches for Reproducibility
            • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
            • Collaborators and Supporters
            • Summary

              Lower bound for all ldquon3-likerdquo linear algebra

              bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no

              matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)

              7

              bull Let M = ldquofastrdquo memory size (per processor)

              words_moved (per processor) = (flops (per processor) M12 )

              messages_sent ge words_moved largest_message_size

              bull Parallel case assume either load or memory balanced

              Lower bound for all ldquon3-likerdquo linear algebra

              bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no

              matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)

              8

              bull Let M = ldquofastrdquo memory size (per processor)

              words_moved (per processor) = (flops (per processor) M12 )

              messages_sent (per processor) = (flops (per processor) M32 )

              bull Parallel case assume either load or memory balanced

              SIAM SIAGLinear Algebra Prize 2012Ballard D Holtz Schwartz

              Limits to parallel scaling (12)bull Consider dense case flops_per_proc = n3P

              ndash Words = (n3(PM12 ))ndash Messages = (n3(PM32 ))

              bull What is M Must be at least n2P to hold datandash Words = (n2P12 )ndash Messages = (P12 )

              bull But if M fixed looks like perfect strong scaling in timendash Flops Words Messages all proportional to 1P

              bull Ditto for energy if we count energy costs in joules hellipndash Per flop per word moved per messagendash Per word per second for data stored in memory Mndash Per second for leakage cooling hellip

              bull How big can we make P and M

              Limits to parallel scaling (22)

              bull Consider dense case flops_per_proc = n3Pndash Words = (n3(PM12 ))ndash Messages = (n3(PM32 ))

              bull How big can we make P and Mbull Assume we start with 1 copy of inputs A and B

              ndash Otherwise no communication may be neededbull Thm Words= (n2P23 ) independent of M

              bull Reached when M = n2P23 too or P = n3M32 and Messages = (1) (log P in practice)bull Attained by 25D algorithm when c=P13 (ldquo3D algrdquo)bull Can keep increasing P until P = n3 Words = Messages = (1) (log n in practice)

              Can we attain these lower bounds

              bull Do conventional dense algorithms as implemented in LAPACK and ScaLAPACK attain these boundsndash Often not

              bull If not are there other algorithms that dondash Yes for much of dense linear algebrandash New algorithms with new numerical properties

              new ways to encode answers new data structures

              ndash Not just loop transformations (need those too)bull Only a few sparse algorithms so farbull Lots of work in progress

              ndash Algorithms Energy Heterogeneous Processors hellip11

              Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

              ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

              ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

              bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

              bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

              25D Matrix Multiplication

              bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid

              c

              (Pc)12

              (Pc)12

              Example P = 32 c = 2

              25D Matrix Multiplication

              bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid

              k

              j

              iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12

              (1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)

              (3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)

              25D Matmul on BGP 16K nodes 64K coresc = 16 copies

              Distinguished Paper Award EuroParrsquo11 (Solomonik D)SCrsquo11 paper by Solomonik Bhatele D

              12x faster

              27x faster

              Perfect Strong Scaling ndash in Time and Energy (12)

              bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2

              bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model

              ndash γT βT αT = secs per flop per word_moved per message of size m

              bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ] = T(P)cbull Notation for energy model

              ndash γE βE αE = joules for same operations

              ndash δE = joules per word of memory used per sec

              ndash εE = joules per sec for leakage etc

              bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP) = E(P)bull Perfect scaling extends to N-body Strassen hellip

              Perfect Strong Scaling ndash in Time and Energy (22)

              bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ] = T(P)cbull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP) = E(P)

              bull Can use these formulas to answer many questions such asndash How to choose p and M to minimize energy E needed for computationndash Given max allowed runtime T what is minimum energy E needed to achieve itndash Given max allowed energy E what is the minimum runtime T attainablendash Can we minimize the average power P = ETndash Given target energy efficiency what architectural parameters are needed to

              achieve itbull Can we attain 75 GflopsWattbull Can we attain an exaflop for 20 MWatts

              Handling Heterogeneitybull Suppose each of P processors could differ

              ndash γi = secflop βi = secword αi = secmessage Mi = memory

              bull What is optimal assignment of work Fi to minimize timendash Ti = Fi γi + Fi βi Mi

              12 + Fi αi Mi32 = Fi [γi + βi Mi

              12 + αi Mi32] = Fi ξi

              ndash Choose Fi so Σi Fi = n3 and minimizing T = maxi Ti

              ndash Answer Fi = n3(1ξi)Σj(1ξj) and T = n3Σj(1ξj)

              bull Optimal Algorithm for nxn matmulndash Recursively divide into 8 half-sized subproblemsndash Assign subproblems to processor i to add up to Fi flops

              bull Works for Strassen other algorithmshellip

              Application to Tensor Contractions

              bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

              bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

              bull Heavily used in electronic structure calculationsndash Ex NWChem

              bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetries

              ndash Solomonik Hammond Matthews

              C(ijk) = Σm A(ijm)B(mk)

              A3-fold symm

              B2-fold symm

              C2-fold symm

              Application to Tensor Contractions

              bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

              bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

              bull Heavily used in electronic structure calculationsndash Ex NWChem for coupled cluster (CC) approach to Schroedinger eqn

              bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetriesndash Up to 3x faster running CC than NWChem on 3072 cores of Cray XE6 ndash Solomonik Hammond Matthews

              Communication Lower Bounds for Strassen-like matmul algorithms

              bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

              bull Extends up to M = n2 p2ω bull Extends to rectangular case multiply (mxn)(nxp) in q mults

              ndash words_moved = Ω (flopsM^(logmpq -1))

              bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz also in JACMbull Is the lower bound attainable

              Classical O(n3) matmul

              words_moved =Ω (M(nM12)3P)

              Strassenrsquos O(nlg7) matmul

              words_moved =Ω (M(nM12)lg7P)

              Strassen-like O(nω) matmul

              words_moved =Ω (M(nM12)ωP)

              vs

              Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

              Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

              CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

              Communication Avoiding Parallel Strassen (CAPS)

              Best way to interleaveBFS and DFS is an tuning parameter

              26

              Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

              Speedups 24-184(over previous Strassen-based algorithms)

              Invited to appear as Research Highlight in CACM

              Strassen-like beyond matmul

              bull Thm (D Dumitriu Holtzrsquo07) Any Strassen-like O(nω) matmul algorithm can be used to build a numerically stable O(nω+η) algorithm for any ηgt0 for Ax=b least squares eig SVD hellipndash ηgt0 needed to deal with numerical stabilityndash Strassen already stable so η=0

              bull Thm For sequential versions of these algorithms Words_moved = O(nω+ηM(ω+η)2 ndash 1 + n2 log n) ie attain expected lower bound

              Ballard D Holtz Schwartz

              Cache and Network Oblivious Algorithms

              bull Motivation Minimizes communication at every level of a hierarchical system without tuning parameters (in theory)ndash Not always 25D Matmul on BGP was topology aware

              bull CAPS Divide-and-conquer choose BFS or DFS to adapt to processors available memory

              bull CARMAndash Divide-and-conquer classical matmul divide largest of 3

              dimensions to create two subproblemsndash Choose BFS or DFS to adapt to processors available memory

              CARMA Performance Distributed Memory

              Square m = k = n = 6144

              ScaLAPACK

              CARMA

              Peak

              (log)

              (log)

              Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA

              CARMA Performance Distributed Memory

              Inner Product m = n = 192 k = 6291456

              ScaLAPACK

              CARMAPeak

              (log)

              (log)

              Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA

              CARMA Performance Shared Memory

              Square m = k = n

              MKL (double)CARMA (double)

              MKL (single)CARMA (single)

              Peak (single)

              Peak (double)

              (log)

              (linear)

              Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

              CARMA Performance Shared Memory

              Inner Product m = n = 64

              MKL (double)

              CARMA (double)

              MKL (single)

              CARMA (single)

              (log)

              (linear)

              Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

              Why is CARMA Faster in Shared MemoryL3 Cache Misses

              Shared Memory Inner Product (m = n = 64 k = 524288)

              97 Fewer Misses

              86 Fewer Misses

              (linear)

              Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

              ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

              ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

              bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

              bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

              One-sided Factorizations (LU QR) so farbull Classical Approach for i=1 to n update column i update trailing matrixbull words_moved = O(n3)

              35

              bull Blocked Approach (LAPACK) for i=1 to nb update block i of b columns update trailing matrixbull words moved = O(n3M13)

              bull Recursive Approach func factor(A) if A has 1 column update it

              else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)

              bull None of these approaches minimizes messagesbull Parallel case Partial

              Pivoting =gt n reductionsbull Need another idea

              TSQR An Architecture-Dependent Algorithm

              W =

              W0

              W1

              W2

              W3

              R00

              R10

              R20

              R30

              R01

              R11

              R02Parallel

              W =

              W0

              W1

              W2

              W3

              R01 R02

              R00

              R03

              SequentialStreaming

              W =

              W0

              W1

              W2

              W3

              R00

              R01R01

              R11

              R02

              R11

              R03

              Dual Core

              Can choose reduction tree dynamicallyMulticore Multisocket Multirack Multisite Out-of-core

              Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

              Wnxb =

              W1

              W2

              W3

              W4

              P1middotL1middotU1

              P2middotL2middotU2

              P3middotL3middotU3

              P4middotL4middotU4

              =

              Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

              W1rsquoW2rsquoW3rsquoW4rsquo

              P12middotL12middotU12

              P34middotL34middotU34

              =Choose b pivot rows call them W12rsquo

              Choose b pivot rows call them W34rsquo

              W12rsquoW34rsquo

              = P1234middotL1234middotU1234

              Choose b pivot rows

              Go back to W and use these b pivot rows (move them to top do LU without pivoting)

              37

              Minimizing Communication in TSLU

              W = W1

              W2

              W3

              W4

              LULULULU

              LU

              LULUParallel

              W = W1

              W2

              W3

              W4

              LULU

              LU

              LUSequentialStreaming

              W = W1

              W2

              W3

              W4

              LULU LU

              LULU

              LULU

              Dual Core

              Can choose reduction tree dynamically to match architecture as before

              38

              Making TSLU Numerically Stable

              bull Details matterndash Going up the tree we could do LU either on original rows of A

              (tournament pivoting) or computed rows of Undash Only tournament pivoting stable

              bull ldquoThmrdquo New scheme as stable as Partial Pivoting (GEPP) in following sense Get same Schur complements as GEPP applied to different input matrix whose entries are blocks taken from input A

              bull Why just a ldquoThmrdquo

              39

              Stability of LU using TSLU CALU

              Summer School Lecture 4 40

              bull Empirical testingndash Both random matrices and ldquospecial onesrdquondash Both binary tree (BCALU) and flat-tree (FCALU)ndash 3 metrics ||PA-LU||||A|| normwise and componentwise backward errorsndash See [D Grigori Xiang 2010] for details

              Why is stability of TSLU just a ldquoThmrdquo

              bull Proof is correct ndash in exact arithmeticbull Experiment

              ndash Generate 100 random 6x6 rank 3 matrices in Matlabndash [LUP] = lu(A) do LU without pivoting on PA compare L factors are

              they the samebull Compute || L ndash Lnp || A few 0rsquos A few infinrsquos a few NaNsbull Rest mostly O(1)

              ndash Why Floating point is nonassociative doing arithmetic in different order gives different rounding errors

              ndash Same experiment with rank 6 matrices || L ndash Lnp || usually nonzero O(macheps)

              ndash Same experiment with 20x20 rank 4 matrices || L ndash Lnp || often O(103)

              bull Much harder to break TSLU but possiblendash Occurred when using TSLU to factorize a low-rank subdiagonal

              panel in symmetric-indefinite factorization 41

              Fixing TSLU

              bull Run TSLU quickly test for stability fix if necessary (rare)

              bull Test conditioning of U if not tiny (usual case) proceed elsebull Compute || L || if not big (usual case) proceed elsebull Factor A = QR using TSQR thenbull Factor Q = PLU using TSLU thenbull A = PL(UR) with UR as upper triangular factor

              bull Last topic in lecture how to guarantee floating point reproducibility

              42

              2D CALU with Tournament Pivoting

              43

              25D CALU with Tournament Pivoting (c=4 copies)

              44

              Exascale Machine ParametersSource DOE Exascale Workshop

              bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

              Exascale predicted speedupsfor Gaussian Elimination

              2D CA-LU vs ScaLAPACK-LU

              log2 (p)

              log 2

              (n2 p

              ) =

              log 2

              (mem

              ory_

              per_

              proc

              )

              Up to 29x

              25D vs 2D LUWith and Without Pivoting

              Other CA algorithms for Ax=b least squares(13)

              bull A symmetric and indefinitendash Seek factorization that retains symmetry PAPT = LDLT D

              ldquosimplerdquobull Save frac12 flops preserve inertia

              ndash Usual approach Bunch-Kaufmanbull D block diagonal with 1x1 and 2x2 blocksbull Pivot search down column along row (lots of communication)

              ndash Alternative Aasenbull D = tridiagonal = Tbull Two steps

              ndash PAPT = LTLT where T is banded using TSLU

              48

              0 0

              0

              0 0

              0

              0

              hellip

              hellip

              ndash Solvefactor narrow band problem with Tbull Up to 28x faster than MKL Best Paper at IPDPSrsquo13

              Other CA algorithms for Ax=b least squares (23)bull Minimizing bandwidth and latency for sequential GEPP

              ndash So far could not do partial pivoting and minimize messages just words

              ndash Challengebull Column layout good for choosing pivots bad for matmulbull Blocked layout good for matmul bad for choosing pivots

              ndash Solution use both layouts switching between thembull ldquoShape Morphing LUrdquo or SMLU

              49

              bull func factor(A) if A has 1 column update it else factor(left half of A)

              update right half of A

              factor(right half of A)

              bull Words = O(n3M12)

              bull Messages = O(n3M)

              bull func factor(A) if A has 1 column update it else factor(left half of A) reshape to recursive block format update right half of A reshape to columnwise format factor(right half of A)

              bull Words = O(n3M12)

              bull Messages = O(n3M32)

              Other CA algorithms for Ax=b least squares (33)bull Need for pivoting arises beyond LU in QR

              ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)

              ndash Usual approach like Partial Pivotingbull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound

              ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two

              groups of b columns either using usual approach or something better (GuEisenstat)

              bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix

              ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 50

              Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

              ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

              ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

              bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

              bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

              What about sparse matrices (13)

              bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

              bull But canrsquot reorder outer loop for 25D need another ideabull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = AB

              ndash Dependencies ok 25D works just different semiringbull Kleenersquos Algorithm

              52

              for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

              D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

              Performance of 25D APSP using Kleene

              53

              Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

              62xspeedup

              2x speedup

              What about sparse matrices (23)

              bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of G(A)

              have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

              bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering for

              2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

              bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

              multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

              separators)

              54

              What about sparse matrices (33)

              bull If matrix stays very sparse lower bound unattainable new one bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

              data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

              bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

              Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

              bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

              along dimensions most likely to minimize cost55

              Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

              ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

              ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

              bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

              bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

              Symmetric Eigenproblem and SVD

              bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd

              bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12

              ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea

              b+1

              b+1

              Successive Band Reduction (BischofLangSun)

              1

              b+1

              b+1

              d+1

              c

              Successive Band Reduction (BischofLangSun)

              b = bandwidthc = columnsd = diagonalsConstraint c+d b

              1Q1

              b+1

              b+1

              d+1

              c

              b = bandwidthc = columnsd = diagonalsConstraint c+d b

              Successive Band Reduction (BischofLangSun)

              12

              Q1

              b+1

              b+1

              d+1

              d+c

              d+c

              c

              b = bandwidthc = columnsd = diagonalsConstraint c+d b

              Successive Band Reduction (BischofLangSun)

              1

              12

              Q1

              Q1T

              b+1

              b+1

              d+1

              d+1

              cd+c

              d+c

              c

              b = bandwidthc = columnsd = diagonalsConstraint c+d b

              Successive Band Reduction (BischofLangSun)

              1

              1

              2

              2Q1

              Q1T

              b+1

              b+1

              d+1

              d+1

              cd+c

              d+c

              d+c

              d+c

              c

              b = bandwidthc = columnsd = diagonalsConstraint c+d b

              Successive Band Reduction (BischofLangSun)

              1

              1

              2

              2

              3

              3

              Q1

              Q1T

              Q2

              Q2T

              b+1

              b+1

              d+1

              d+1

              d+c

              d+c

              d+c

              d+c

              c

              c

              b = bandwidthc = columnsd = diagonalsConstraint c+d b

              Successive Band Reduction (BischofLangSun)

              1

              1

              2

              2

              3

              3

              4

              4

              Q1

              Q1T

              Q2

              Q2T

              Q3

              Q3T

              b+1

              b+1

              d+1

              d+1

              d+c

              d+c

              d+c

              d+c

              c

              c

              b = bandwidthc = columnsd = diagonalsConstraint c+d b

              Successive Band Reduction (BischofLangSun)

              1

              1

              2

              2

              3

              3

              4

              4

              5

              5

              Q1

              Q1T

              Q2

              Q2T

              Q3

              Q3T

              Q4

              Q4T

              b+1

              b+1

              d+1

              d+1

              c

              c

              d+c

              d+c

              d+c

              d+c

              b = bandwidthc = columnsd = diagonalsConstraint c+d b

              Successive Band Reduction (BischofLangSun)

              1

              1

              2

              2

              3

              3

              4

              4

              5

              5

              Q5T

              Q1

              Q1T

              Q2

              Q2T

              Q3

              Q3T

              Q5

              Q4

              Q4T

              b+1

              b+1

              d+1

              d+1

              c

              c

              d+c

              d+c

              d+c

              d+c

              b = bandwidthc = columnsd = diagonalsConstraint c+d b

              Successive Band Reduction (BischofLangSun)

              1

              1

              2

              2

              3

              3

              4

              4

              5

              5

              6

              6

              Q5T

              Q1

              Q1T

              Q2

              Q2T

              Q3

              Q3T

              Q5

              Q4

              Q4T

              b+1

              b+1

              d+1

              d+1

              c

              c

              d+c

              d+c

              d+c

              d+c

              b = bandwidthc = columnsd = diagonalsConstraint c+d b

              Successive Band Reduction (BischofLangSun)

              Conventional vs CA - SBR

              Conventional Communication-Avoiding

              Touch all data 4 times Touch all data once

              >
              >

              Speedups of Sym Band Reductionvs DSBTRD

              bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

              bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

              bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

              bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

              bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

              Nonsymmetric Eigenproblem

              bull No apparent way to modify standard algorithmbull Instead Spectral Divide-and-Conquer

              ndash Find orthogonal matrix Q whose leading columns span an invariant subspace of A

              ndash QTAQ will be block upper triangularndash Apply recursively to A11 A22

              ndash Depends on randomization1 Randomized Rank Revealing QR decomposition2 Randomized location to try splitting spectrum

              A11 A12

              ε A22

              Attaining the Lower bounds SequentialLegend[Existing][Ours][Math-Lib][Random]

              Two Levels Memory Hierarchy

              Words Messages Words Messages

              BLAS-3 [FLPRrsquo99][BDLSTrsquo13][MKL etc] [FLPRrsquo99][BDLSTrsquo13][MKL etc]

              Cholesky[Grsquo97][APrsquo00]

              [LAPACK][BDHSrsquo09]

              [Grsquo97][APrsquo00][BDHSrsquo09] [Grsquo97][APrsquo00][BDHSrsquo09]

              Sym Indefinite [BBDDDPSTYrsquo13] [BBDDDPSTYrsquo13]

              LU[Grsquo97][Trsquo97]

              [GDXrsquo11][BDLSTrsquo13]

              [GDXrsquo11][BDLSTrsquo13]

              [Grsquo97][Trsquo97] [BDLSTrsquo13] [BDLSTrsquo13]

              QR[EGrsquo98][FWrsquo03]

              [DGHLrsquo12][BDLSTrsquo13]

              [FWrsquo03][DGHLrsquo12][BDLSTrsquo13]

              [EGrsquo98][FWrsquo03][BDLSTrsquo13]

              [FWrsquo03][BDLSTrsquo13]

              Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

              Sym Eig amp SVD [BDDrsquo11][BDKrsquo13] [BDDrsquo11]

              Non Sym Eig [BDDrsquo11] [BDDrsquo11]

              Legend[Existing][Ours][Math-Lib][Random]

              Words (BW) Messages (L) Saving factor

              BLAS-3 [AGZrsquo94][MTrsquo99][ScaLAPACK][Crsquo69][vGWrsquo97][SDrsquo11] L nP12

              Cholesky [ScaLAPACK][Trsquo99][SDrsquo11] L nP12

              Sym Indefinite [BBDDDPSTYrsquo13][ScaLAPACK] [BBDDDPSTYrsquo13] L nP12

              LU [ScaLAPACK][GDXrsquo11][Trsquo99][SDrsquo11] [GDXrsquo11][Trsquo99][SDrsquo11] L nP12

              QR [ScaLAPACK][DGHLrsquo12] [Trsquo99] [DGHLrsquo12][Trsquo99] L nP12

              Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

              Sym Eig amp SVD [BDDrsquo11][BDKrsquo13][ScaLAPACK] [BDDrsquo11][BDKrsquo13] L nP12

              Non-Sym Eig [BDDrsquo11] [BDDrsquo11] BW P12 L n

              Attaining with extra memory 25D M=(cn2P)

              Attaining the Lower bounds Parallel 2DM=(n2P)(Ignoring poly-log(P) factors words = ( n2 P12) messages = (P12)

              Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

              ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

              ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

              bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

              bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

              Avoiding Communication in Iterative Linear Algebra

              bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

              bull Conjugate Gradients (CG) GMRES Lanczos Arnoldi hellip bull Goal minimize communication

              ndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

              bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

              ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

              bull Lots of speed up possible (modeled and measured)ndash Price some redundant computationndash Challenges Poor partitioning Preconditioning Num Stability

              75

              Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

              ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

              ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

              bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

              bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

              Example The Difficulty of Tuning SpMV

              bull n = 21200bull nnz = 15 M

              bull Source NASA structural analysis problem (raefsky)

              77

              Example The Difficulty of Tuning

              bull n = 21200bull nnz = 15 M

              bull Source NASA structural analysis problem (raefsky)

              bull 8x8 dense substructure exploit this to limit mem_refs

              78

              Speedups on Itanium 2 The Need for Search

              Reference

              Best 4x2

              Mflops

              Mflops

              79

              Register Profile Itanium 2

              190 Mflops

              1190 Mflops

              80

              Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16

              Itanium 2 - 33Itanium 1 - 8

              252 Mflops

              122 Mflops

              820 Mflops

              459 Mflops

              247 Mflops

              107 Mflops

              12 Gflops

              190 Mflops

              Another example of tuning challenges for SpMV

              bull Ex11 matrix (fluid flow)

              bull More complicated non-zero structure in general

              bull N = 16614bull NNZ = 11M

              82

              Zoom in to top corner

              bull More complicated non-zero structure in general

              bull N = 16614bull NNZ = 11M

              83

              3x3 blocks look natural buthellip

              bull Example 3x3 blockingndash Logical grid of 3x3 cells

              bull But would lead to lots of ldquofill-inrdquo

              84

              Extra Work Can Improve Efficiency

              bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

              bull On Pentium III 15x speedupndash Actual mflop rate 152 = 225 higher

              85

              Source Accelerator Cavity Design Problem (Ko via Husbands)

              86

              100x100 Submatrix Along Diagonal

              Summer School Lecture 787

              Post-RCM Reordering

              88

              Effect of Combined RCM+TSP Reordering

              Before Green + RedAfter Green + Blue

              Summer School Lecture 789

              2x speedups on Pentium 4 Power 4 hellip

              Summary of Other Performance Optimizations

              bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

              bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

              bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

              90

              Optimized Sparse Kernel Interface - OSKI

              bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

              bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

              bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

              software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

              91

              Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

              ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

              ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

              bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

              bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

              93

              Example Classical Conjugate Gradient (CG)

              SpMVs and dot products require communication in

              each iteration

              via CA Matrix Powers Kernel

              Global reduction to compute G

              94

              Example CA-Conjugate Gradient

              Local computations within inner loop require

              no communication

              Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

              ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

              ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

              bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

              bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

              96

              Slower convergence due

              to roundoff

              Loss of accuracy due to roundoff

              At s = 16 monomial basis is rank deficient Method breaks down

              Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

              CA-CG (monomial)CG

              machine precision

              97

              Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

              ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

              ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

              bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

              bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

              What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

              bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

              bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

              matrices

              Explicit (O(nnz)) Implicit (o(nnz))

              Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

              Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

              Indices

              Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

              ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

              ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

              bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

              bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

              101

              bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

              ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

              demanded by customers (construction engineers) otherwise they donrsquot believe results

              ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

              ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

              Reproducible Floating Point Computation

              Absolute Error for Random Vectors

              Same magnitude opposite signs

              Intel MKL non-reproducibility

              Relative Error for Orthogonal vectors

              Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

              Sign notreproducible

              103

              bull Consider summation or dot productbull Goals

              1 Same answer independent of layout processors order of summands

              2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

              bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

              GoalsApproaches for Reproducibility

              104

              Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

              Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

              Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

              bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

              Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

              bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

              Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

              Instruments NEC Nokia NVIDIA Samsung Oracle

              bull bebopcsberkeleyedu

              Summary

              Donrsquot Communichellip

              106

              Time to redesign all linear algebra n-body hellip algorithms and software

              (and compilers)

              • Implementing Communication-Avoiding Algorithms
              • Why avoid communication
              • Goals
              • Outline
              • Outline (2)
              • Lower bound for all ldquon3-likerdquo linear algebra
              • Lower bound for all ldquon3-likerdquo linear algebra (2)
              • Lower bound for all ldquon3-likerdquo linear algebra (3)
              • Limits to parallel scaling (12)
              • Limits to parallel scaling (22)
              • Can we attain these lower bounds
              • Outline (3)
              • 25D Matrix Multiplication
              • 25D Matrix Multiplication (2)
              • 25D Matmul on BGP 16K nodes 64K cores (2)
              • Perfect Strong Scaling ndash in Time and Energy (12)
              • Perfect Strong Scaling ndash in Time and Energy (22)
              • Handling Heterogeneity
              • Application to Tensor Contractions
              • C(ijk) = Σm A(ijm)B(mk)
              • Application to Tensor Contractions (2)
              • Communication Lower Bounds for Strassen-like matmul algorithms
              • vs
              • Slide 26
              • Strassen-like beyond matmul
              • Cache and Network Oblivious Algorithms
              • CARMA Performance Distributed Memory
              • CARMA Performance Distributed Memory (2)
              • CARMA Performance Shared Memory
              • CARMA Performance Shared Memory (2)
              • Why is CARMA Faster in Shared Memory
              • Outline (4)
              • One-sided Factorizations (LU QR) so far
              • TSQR An Architecture-Dependent Algorithm
              • Back to LU Using similar idea for TSLU as TSQR Use reduction
              • Minimizing Communication in TSLU
              • Making TSLU Numerically Stable
              • Stability of LU using TSLU CALU
              • Why is stability of TSLU just a ldquoThmrdquo
              • Fixing TSLU
              • 2D CALU with Tournament Pivoting
              • 25D CALU with Tournament Pivoting (c=4 copies)
              • Exascale Machine Parameters Source DOE Exascale Workshop
              • Exascale predicted speedups for Gaussian Elimination 2D CA
              • 25D vs 2D LU With and Without Pivoting
              • Other CA algorithms for Ax=b least squares(13)
              • Other CA algorithms for Ax=b least squares (23)
              • Other CA algorithms for Ax=b least squares (33)
              • Outline (5)
              • What about sparse matrices (13)
              • Performance of 25D APSP using Kleene
              • What about sparse matrices (23)
              • What about sparse matrices (33)
              • Outline (6)
              • Symmetric Eigenproblem and SVD
              • Slide 58
              • Slide 59
              • Slide 60
              • Slide 61
              • Slide 62
              • Slide 63
              • Slide 64
              • Slide 65
              • Slide 66
              • Slide 67
              • Slide 68
              • Conventional vs CA - SBR
              • Speedups of Sym Band Reduction vs DSBTRD
              • Nonsymmetric Eigenproblem
              • Attaining the Lower bounds Sequential
              • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
              • Outline (7)
              • Avoiding Communication in Iterative Linear Algebra
              • Outline (8)
              • Example The Difficulty of Tuning SpMV
              • Example The Difficulty of Tuning
              • Speedups on Itanium 2 The Need for Search
              • Register Profile Itanium 2
              • Register Profiles IBM and Intel IA-64
              • Another example of tuning challenges for SpMV
              • Zoom in to top corner
              • 3x3 blocks look natural buthellip
              • Extra Work Can Improve Efficiency
              • Slide 86
              • Slide 87
              • Slide 88
              • Slide 89
              • Summary of Other Performance Optimizations
              • Optimized Sparse Kernel Interface - OSKI
              • Outline (9)
              • Example Classical Conjugate Gradient (CG)
              • Example CA-Conjugate Gradient
              • Outline (10)
              • Slide 96
              • Slide 97
              • Outline (11)
              • What is a ldquosparse matrixrdquo
              • Outline (12)
              • Reproducible Floating Point Computation
              • Intel MKL non-reproducibility
              • GoalsApproaches for Reproducibility
              • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
              • Collaborators and Supporters
              • Summary

                Lower bound for all ldquon3-likerdquo linear algebra

                bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no

                matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)

                8

                bull Let M = ldquofastrdquo memory size (per processor)

                words_moved (per processor) = (flops (per processor) M12 )

                messages_sent (per processor) = (flops (per processor) M32 )

                bull Parallel case assume either load or memory balanced

                SIAM SIAGLinear Algebra Prize 2012Ballard D Holtz Schwartz

                Limits to parallel scaling (12)bull Consider dense case flops_per_proc = n3P

                ndash Words = (n3(PM12 ))ndash Messages = (n3(PM32 ))

                bull What is M Must be at least n2P to hold datandash Words = (n2P12 )ndash Messages = (P12 )

                bull But if M fixed looks like perfect strong scaling in timendash Flops Words Messages all proportional to 1P

                bull Ditto for energy if we count energy costs in joules hellipndash Per flop per word moved per messagendash Per word per second for data stored in memory Mndash Per second for leakage cooling hellip

                bull How big can we make P and M

                Limits to parallel scaling (22)

                bull Consider dense case flops_per_proc = n3Pndash Words = (n3(PM12 ))ndash Messages = (n3(PM32 ))

                bull How big can we make P and Mbull Assume we start with 1 copy of inputs A and B

                ndash Otherwise no communication may be neededbull Thm Words= (n2P23 ) independent of M

                bull Reached when M = n2P23 too or P = n3M32 and Messages = (1) (log P in practice)bull Attained by 25D algorithm when c=P13 (ldquo3D algrdquo)bull Can keep increasing P until P = n3 Words = Messages = (1) (log n in practice)

                Can we attain these lower bounds

                bull Do conventional dense algorithms as implemented in LAPACK and ScaLAPACK attain these boundsndash Often not

                bull If not are there other algorithms that dondash Yes for much of dense linear algebrandash New algorithms with new numerical properties

                new ways to encode answers new data structures

                ndash Not just loop transformations (need those too)bull Only a few sparse algorithms so farbull Lots of work in progress

                ndash Algorithms Energy Heterogeneous Processors hellip11

                Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                25D Matrix Multiplication

                bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid

                c

                (Pc)12

                (Pc)12

                Example P = 32 c = 2

                25D Matrix Multiplication

                bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid

                k

                j

                iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12

                (1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)

                (3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)

                25D Matmul on BGP 16K nodes 64K coresc = 16 copies

                Distinguished Paper Award EuroParrsquo11 (Solomonik D)SCrsquo11 paper by Solomonik Bhatele D

                12x faster

                27x faster

                Perfect Strong Scaling ndash in Time and Energy (12)

                bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2

                bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model

                ndash γT βT αT = secs per flop per word_moved per message of size m

                bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ] = T(P)cbull Notation for energy model

                ndash γE βE αE = joules for same operations

                ndash δE = joules per word of memory used per sec

                ndash εE = joules per sec for leakage etc

                bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP) = E(P)bull Perfect scaling extends to N-body Strassen hellip

                Perfect Strong Scaling ndash in Time and Energy (22)

                bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ] = T(P)cbull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP) = E(P)

                bull Can use these formulas to answer many questions such asndash How to choose p and M to minimize energy E needed for computationndash Given max allowed runtime T what is minimum energy E needed to achieve itndash Given max allowed energy E what is the minimum runtime T attainablendash Can we minimize the average power P = ETndash Given target energy efficiency what architectural parameters are needed to

                achieve itbull Can we attain 75 GflopsWattbull Can we attain an exaflop for 20 MWatts

                Handling Heterogeneitybull Suppose each of P processors could differ

                ndash γi = secflop βi = secword αi = secmessage Mi = memory

                bull What is optimal assignment of work Fi to minimize timendash Ti = Fi γi + Fi βi Mi

                12 + Fi αi Mi32 = Fi [γi + βi Mi

                12 + αi Mi32] = Fi ξi

                ndash Choose Fi so Σi Fi = n3 and minimizing T = maxi Ti

                ndash Answer Fi = n3(1ξi)Σj(1ξj) and T = n3Σj(1ξj)

                bull Optimal Algorithm for nxn matmulndash Recursively divide into 8 half-sized subproblemsndash Assign subproblems to processor i to add up to Fi flops

                bull Works for Strassen other algorithmshellip

                Application to Tensor Contractions

                bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

                bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

                bull Heavily used in electronic structure calculationsndash Ex NWChem

                bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetries

                ndash Solomonik Hammond Matthews

                C(ijk) = Σm A(ijm)B(mk)

                A3-fold symm

                B2-fold symm

                C2-fold symm

                Application to Tensor Contractions

                bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

                bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

                bull Heavily used in electronic structure calculationsndash Ex NWChem for coupled cluster (CC) approach to Schroedinger eqn

                bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetriesndash Up to 3x faster running CC than NWChem on 3072 cores of Cray XE6 ndash Solomonik Hammond Matthews

                Communication Lower Bounds for Strassen-like matmul algorithms

                bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

                bull Extends up to M = n2 p2ω bull Extends to rectangular case multiply (mxn)(nxp) in q mults

                ndash words_moved = Ω (flopsM^(logmpq -1))

                bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz also in JACMbull Is the lower bound attainable

                Classical O(n3) matmul

                words_moved =Ω (M(nM12)3P)

                Strassenrsquos O(nlg7) matmul

                words_moved =Ω (M(nM12)lg7P)

                Strassen-like O(nω) matmul

                words_moved =Ω (M(nM12)ωP)

                vs

                Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

                Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

                CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

                Communication Avoiding Parallel Strassen (CAPS)

                Best way to interleaveBFS and DFS is an tuning parameter

                26

                Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

                Speedups 24-184(over previous Strassen-based algorithms)

                Invited to appear as Research Highlight in CACM

                Strassen-like beyond matmul

                bull Thm (D Dumitriu Holtzrsquo07) Any Strassen-like O(nω) matmul algorithm can be used to build a numerically stable O(nω+η) algorithm for any ηgt0 for Ax=b least squares eig SVD hellipndash ηgt0 needed to deal with numerical stabilityndash Strassen already stable so η=0

                bull Thm For sequential versions of these algorithms Words_moved = O(nω+ηM(ω+η)2 ndash 1 + n2 log n) ie attain expected lower bound

                Ballard D Holtz Schwartz

                Cache and Network Oblivious Algorithms

                bull Motivation Minimizes communication at every level of a hierarchical system without tuning parameters (in theory)ndash Not always 25D Matmul on BGP was topology aware

                bull CAPS Divide-and-conquer choose BFS or DFS to adapt to processors available memory

                bull CARMAndash Divide-and-conquer classical matmul divide largest of 3

                dimensions to create two subproblemsndash Choose BFS or DFS to adapt to processors available memory

                CARMA Performance Distributed Memory

                Square m = k = n = 6144

                ScaLAPACK

                CARMA

                Peak

                (log)

                (log)

                Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA

                CARMA Performance Distributed Memory

                Inner Product m = n = 192 k = 6291456

                ScaLAPACK

                CARMAPeak

                (log)

                (log)

                Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA

                CARMA Performance Shared Memory

                Square m = k = n

                MKL (double)CARMA (double)

                MKL (single)CARMA (single)

                Peak (single)

                Peak (double)

                (log)

                (linear)

                Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

                CARMA Performance Shared Memory

                Inner Product m = n = 64

                MKL (double)

                CARMA (double)

                MKL (single)

                CARMA (single)

                (log)

                (linear)

                Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

                Why is CARMA Faster in Shared MemoryL3 Cache Misses

                Shared Memory Inner Product (m = n = 64 k = 524288)

                97 Fewer Misses

                86 Fewer Misses

                (linear)

                Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                One-sided Factorizations (LU QR) so farbull Classical Approach for i=1 to n update column i update trailing matrixbull words_moved = O(n3)

                35

                bull Blocked Approach (LAPACK) for i=1 to nb update block i of b columns update trailing matrixbull words moved = O(n3M13)

                bull Recursive Approach func factor(A) if A has 1 column update it

                else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)

                bull None of these approaches minimizes messagesbull Parallel case Partial

                Pivoting =gt n reductionsbull Need another idea

                TSQR An Architecture-Dependent Algorithm

                W =

                W0

                W1

                W2

                W3

                R00

                R10

                R20

                R30

                R01

                R11

                R02Parallel

                W =

                W0

                W1

                W2

                W3

                R01 R02

                R00

                R03

                SequentialStreaming

                W =

                W0

                W1

                W2

                W3

                R00

                R01R01

                R11

                R02

                R11

                R03

                Dual Core

                Can choose reduction tree dynamicallyMulticore Multisocket Multirack Multisite Out-of-core

                Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

                Wnxb =

                W1

                W2

                W3

                W4

                P1middotL1middotU1

                P2middotL2middotU2

                P3middotL3middotU3

                P4middotL4middotU4

                =

                Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

                W1rsquoW2rsquoW3rsquoW4rsquo

                P12middotL12middotU12

                P34middotL34middotU34

                =Choose b pivot rows call them W12rsquo

                Choose b pivot rows call them W34rsquo

                W12rsquoW34rsquo

                = P1234middotL1234middotU1234

                Choose b pivot rows

                Go back to W and use these b pivot rows (move them to top do LU without pivoting)

                37

                Minimizing Communication in TSLU

                W = W1

                W2

                W3

                W4

                LULULULU

                LU

                LULUParallel

                W = W1

                W2

                W3

                W4

                LULU

                LU

                LUSequentialStreaming

                W = W1

                W2

                W3

                W4

                LULU LU

                LULU

                LULU

                Dual Core

                Can choose reduction tree dynamically to match architecture as before

                38

                Making TSLU Numerically Stable

                bull Details matterndash Going up the tree we could do LU either on original rows of A

                (tournament pivoting) or computed rows of Undash Only tournament pivoting stable

                bull ldquoThmrdquo New scheme as stable as Partial Pivoting (GEPP) in following sense Get same Schur complements as GEPP applied to different input matrix whose entries are blocks taken from input A

                bull Why just a ldquoThmrdquo

                39

                Stability of LU using TSLU CALU

                Summer School Lecture 4 40

                bull Empirical testingndash Both random matrices and ldquospecial onesrdquondash Both binary tree (BCALU) and flat-tree (FCALU)ndash 3 metrics ||PA-LU||||A|| normwise and componentwise backward errorsndash See [D Grigori Xiang 2010] for details

                Why is stability of TSLU just a ldquoThmrdquo

                bull Proof is correct ndash in exact arithmeticbull Experiment

                ndash Generate 100 random 6x6 rank 3 matrices in Matlabndash [LUP] = lu(A) do LU without pivoting on PA compare L factors are

                they the samebull Compute || L ndash Lnp || A few 0rsquos A few infinrsquos a few NaNsbull Rest mostly O(1)

                ndash Why Floating point is nonassociative doing arithmetic in different order gives different rounding errors

                ndash Same experiment with rank 6 matrices || L ndash Lnp || usually nonzero O(macheps)

                ndash Same experiment with 20x20 rank 4 matrices || L ndash Lnp || often O(103)

                bull Much harder to break TSLU but possiblendash Occurred when using TSLU to factorize a low-rank subdiagonal

                panel in symmetric-indefinite factorization 41

                Fixing TSLU

                bull Run TSLU quickly test for stability fix if necessary (rare)

                bull Test conditioning of U if not tiny (usual case) proceed elsebull Compute || L || if not big (usual case) proceed elsebull Factor A = QR using TSQR thenbull Factor Q = PLU using TSLU thenbull A = PL(UR) with UR as upper triangular factor

                bull Last topic in lecture how to guarantee floating point reproducibility

                42

                2D CALU with Tournament Pivoting

                43

                25D CALU with Tournament Pivoting (c=4 copies)

                44

                Exascale Machine ParametersSource DOE Exascale Workshop

                bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

                Exascale predicted speedupsfor Gaussian Elimination

                2D CA-LU vs ScaLAPACK-LU

                log2 (p)

                log 2

                (n2 p

                ) =

                log 2

                (mem

                ory_

                per_

                proc

                )

                Up to 29x

                25D vs 2D LUWith and Without Pivoting

                Other CA algorithms for Ax=b least squares(13)

                bull A symmetric and indefinitendash Seek factorization that retains symmetry PAPT = LDLT D

                ldquosimplerdquobull Save frac12 flops preserve inertia

                ndash Usual approach Bunch-Kaufmanbull D block diagonal with 1x1 and 2x2 blocksbull Pivot search down column along row (lots of communication)

                ndash Alternative Aasenbull D = tridiagonal = Tbull Two steps

                ndash PAPT = LTLT where T is banded using TSLU

                48

                0 0

                0

                0 0

                0

                0

                hellip

                hellip

                ndash Solvefactor narrow band problem with Tbull Up to 28x faster than MKL Best Paper at IPDPSrsquo13

                Other CA algorithms for Ax=b least squares (23)bull Minimizing bandwidth and latency for sequential GEPP

                ndash So far could not do partial pivoting and minimize messages just words

                ndash Challengebull Column layout good for choosing pivots bad for matmulbull Blocked layout good for matmul bad for choosing pivots

                ndash Solution use both layouts switching between thembull ldquoShape Morphing LUrdquo or SMLU

                49

                bull func factor(A) if A has 1 column update it else factor(left half of A)

                update right half of A

                factor(right half of A)

                bull Words = O(n3M12)

                bull Messages = O(n3M)

                bull func factor(A) if A has 1 column update it else factor(left half of A) reshape to recursive block format update right half of A reshape to columnwise format factor(right half of A)

                bull Words = O(n3M12)

                bull Messages = O(n3M32)

                Other CA algorithms for Ax=b least squares (33)bull Need for pivoting arises beyond LU in QR

                ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)

                ndash Usual approach like Partial Pivotingbull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound

                ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two

                groups of b columns either using usual approach or something better (GuEisenstat)

                bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix

                ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 50

                Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                What about sparse matrices (13)

                bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

                bull But canrsquot reorder outer loop for 25D need another ideabull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = AB

                ndash Dependencies ok 25D works just different semiringbull Kleenersquos Algorithm

                52

                for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

                D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

                Performance of 25D APSP using Kleene

                53

                Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

                62xspeedup

                2x speedup

                What about sparse matrices (23)

                bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of G(A)

                have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

                bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering for

                2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

                bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

                multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

                separators)

                54

                What about sparse matrices (33)

                bull If matrix stays very sparse lower bound unattainable new one bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

                data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

                bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

                Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

                bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

                along dimensions most likely to minimize cost55

                Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                Symmetric Eigenproblem and SVD

                bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd

                bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12

                ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea

                b+1

                b+1

                Successive Band Reduction (BischofLangSun)

                1

                b+1

                b+1

                d+1

                c

                Successive Band Reduction (BischofLangSun)

                b = bandwidthc = columnsd = diagonalsConstraint c+d b

                1Q1

                b+1

                b+1

                d+1

                c

                b = bandwidthc = columnsd = diagonalsConstraint c+d b

                Successive Band Reduction (BischofLangSun)

                12

                Q1

                b+1

                b+1

                d+1

                d+c

                d+c

                c

                b = bandwidthc = columnsd = diagonalsConstraint c+d b

                Successive Band Reduction (BischofLangSun)

                1

                12

                Q1

                Q1T

                b+1

                b+1

                d+1

                d+1

                cd+c

                d+c

                c

                b = bandwidthc = columnsd = diagonalsConstraint c+d b

                Successive Band Reduction (BischofLangSun)

                1

                1

                2

                2Q1

                Q1T

                b+1

                b+1

                d+1

                d+1

                cd+c

                d+c

                d+c

                d+c

                c

                b = bandwidthc = columnsd = diagonalsConstraint c+d b

                Successive Band Reduction (BischofLangSun)

                1

                1

                2

                2

                3

                3

                Q1

                Q1T

                Q2

                Q2T

                b+1

                b+1

                d+1

                d+1

                d+c

                d+c

                d+c

                d+c

                c

                c

                b = bandwidthc = columnsd = diagonalsConstraint c+d b

                Successive Band Reduction (BischofLangSun)

                1

                1

                2

                2

                3

                3

                4

                4

                Q1

                Q1T

                Q2

                Q2T

                Q3

                Q3T

                b+1

                b+1

                d+1

                d+1

                d+c

                d+c

                d+c

                d+c

                c

                c

                b = bandwidthc = columnsd = diagonalsConstraint c+d b

                Successive Band Reduction (BischofLangSun)

                1

                1

                2

                2

                3

                3

                4

                4

                5

                5

                Q1

                Q1T

                Q2

                Q2T

                Q3

                Q3T

                Q4

                Q4T

                b+1

                b+1

                d+1

                d+1

                c

                c

                d+c

                d+c

                d+c

                d+c

                b = bandwidthc = columnsd = diagonalsConstraint c+d b

                Successive Band Reduction (BischofLangSun)

                1

                1

                2

                2

                3

                3

                4

                4

                5

                5

                Q5T

                Q1

                Q1T

                Q2

                Q2T

                Q3

                Q3T

                Q5

                Q4

                Q4T

                b+1

                b+1

                d+1

                d+1

                c

                c

                d+c

                d+c

                d+c

                d+c

                b = bandwidthc = columnsd = diagonalsConstraint c+d b

                Successive Band Reduction (BischofLangSun)

                1

                1

                2

                2

                3

                3

                4

                4

                5

                5

                6

                6

                Q5T

                Q1

                Q1T

                Q2

                Q2T

                Q3

                Q3T

                Q5

                Q4

                Q4T

                b+1

                b+1

                d+1

                d+1

                c

                c

                d+c

                d+c

                d+c

                d+c

                b = bandwidthc = columnsd = diagonalsConstraint c+d b

                Successive Band Reduction (BischofLangSun)

                Conventional vs CA - SBR

                Conventional Communication-Avoiding

                Touch all data 4 times Touch all data once

                >
                >

                Speedups of Sym Band Reductionvs DSBTRD

                bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

                bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

                bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

                bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

                bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

                Nonsymmetric Eigenproblem

                bull No apparent way to modify standard algorithmbull Instead Spectral Divide-and-Conquer

                ndash Find orthogonal matrix Q whose leading columns span an invariant subspace of A

                ndash QTAQ will be block upper triangularndash Apply recursively to A11 A22

                ndash Depends on randomization1 Randomized Rank Revealing QR decomposition2 Randomized location to try splitting spectrum

                A11 A12

                ε A22

                Attaining the Lower bounds SequentialLegend[Existing][Ours][Math-Lib][Random]

                Two Levels Memory Hierarchy

                Words Messages Words Messages

                BLAS-3 [FLPRrsquo99][BDLSTrsquo13][MKL etc] [FLPRrsquo99][BDLSTrsquo13][MKL etc]

                Cholesky[Grsquo97][APrsquo00]

                [LAPACK][BDHSrsquo09]

                [Grsquo97][APrsquo00][BDHSrsquo09] [Grsquo97][APrsquo00][BDHSrsquo09]

                Sym Indefinite [BBDDDPSTYrsquo13] [BBDDDPSTYrsquo13]

                LU[Grsquo97][Trsquo97]

                [GDXrsquo11][BDLSTrsquo13]

                [GDXrsquo11][BDLSTrsquo13]

                [Grsquo97][Trsquo97] [BDLSTrsquo13] [BDLSTrsquo13]

                QR[EGrsquo98][FWrsquo03]

                [DGHLrsquo12][BDLSTrsquo13]

                [FWrsquo03][DGHLrsquo12][BDLSTrsquo13]

                [EGrsquo98][FWrsquo03][BDLSTrsquo13]

                [FWrsquo03][BDLSTrsquo13]

                Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

                Sym Eig amp SVD [BDDrsquo11][BDKrsquo13] [BDDrsquo11]

                Non Sym Eig [BDDrsquo11] [BDDrsquo11]

                Legend[Existing][Ours][Math-Lib][Random]

                Words (BW) Messages (L) Saving factor

                BLAS-3 [AGZrsquo94][MTrsquo99][ScaLAPACK][Crsquo69][vGWrsquo97][SDrsquo11] L nP12

                Cholesky [ScaLAPACK][Trsquo99][SDrsquo11] L nP12

                Sym Indefinite [BBDDDPSTYrsquo13][ScaLAPACK] [BBDDDPSTYrsquo13] L nP12

                LU [ScaLAPACK][GDXrsquo11][Trsquo99][SDrsquo11] [GDXrsquo11][Trsquo99][SDrsquo11] L nP12

                QR [ScaLAPACK][DGHLrsquo12] [Trsquo99] [DGHLrsquo12][Trsquo99] L nP12

                Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

                Sym Eig amp SVD [BDDrsquo11][BDKrsquo13][ScaLAPACK] [BDDrsquo11][BDKrsquo13] L nP12

                Non-Sym Eig [BDDrsquo11] [BDDrsquo11] BW P12 L n

                Attaining with extra memory 25D M=(cn2P)

                Attaining the Lower bounds Parallel 2DM=(n2P)(Ignoring poly-log(P) factors words = ( n2 P12) messages = (P12)

                Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                Avoiding Communication in Iterative Linear Algebra

                bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

                bull Conjugate Gradients (CG) GMRES Lanczos Arnoldi hellip bull Goal minimize communication

                ndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

                bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

                ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

                bull Lots of speed up possible (modeled and measured)ndash Price some redundant computationndash Challenges Poor partitioning Preconditioning Num Stability

                75

                Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                Example The Difficulty of Tuning SpMV

                bull n = 21200bull nnz = 15 M

                bull Source NASA structural analysis problem (raefsky)

                77

                Example The Difficulty of Tuning

                bull n = 21200bull nnz = 15 M

                bull Source NASA structural analysis problem (raefsky)

                bull 8x8 dense substructure exploit this to limit mem_refs

                78

                Speedups on Itanium 2 The Need for Search

                Reference

                Best 4x2

                Mflops

                Mflops

                79

                Register Profile Itanium 2

                190 Mflops

                1190 Mflops

                80

                Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16

                Itanium 2 - 33Itanium 1 - 8

                252 Mflops

                122 Mflops

                820 Mflops

                459 Mflops

                247 Mflops

                107 Mflops

                12 Gflops

                190 Mflops

                Another example of tuning challenges for SpMV

                bull Ex11 matrix (fluid flow)

                bull More complicated non-zero structure in general

                bull N = 16614bull NNZ = 11M

                82

                Zoom in to top corner

                bull More complicated non-zero structure in general

                bull N = 16614bull NNZ = 11M

                83

                3x3 blocks look natural buthellip

                bull Example 3x3 blockingndash Logical grid of 3x3 cells

                bull But would lead to lots of ldquofill-inrdquo

                84

                Extra Work Can Improve Efficiency

                bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

                bull On Pentium III 15x speedupndash Actual mflop rate 152 = 225 higher

                85

                Source Accelerator Cavity Design Problem (Ko via Husbands)

                86

                100x100 Submatrix Along Diagonal

                Summer School Lecture 787

                Post-RCM Reordering

                88

                Effect of Combined RCM+TSP Reordering

                Before Green + RedAfter Green + Blue

                Summer School Lecture 789

                2x speedups on Pentium 4 Power 4 hellip

                Summary of Other Performance Optimizations

                bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

                bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

                bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

                90

                Optimized Sparse Kernel Interface - OSKI

                bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

                bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

                bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

                software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

                91

                Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                93

                Example Classical Conjugate Gradient (CG)

                SpMVs and dot products require communication in

                each iteration

                via CA Matrix Powers Kernel

                Global reduction to compute G

                94

                Example CA-Conjugate Gradient

                Local computations within inner loop require

                no communication

                Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                96

                Slower convergence due

                to roundoff

                Loss of accuracy due to roundoff

                At s = 16 monomial basis is rank deficient Method breaks down

                Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

                CA-CG (monomial)CG

                machine precision

                97

                Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

                bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

                bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

                matrices

                Explicit (O(nnz)) Implicit (o(nnz))

                Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

                Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

                Indices

                Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                101

                bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

                ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

                demanded by customers (construction engineers) otherwise they donrsquot believe results

                ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

                ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

                Reproducible Floating Point Computation

                Absolute Error for Random Vectors

                Same magnitude opposite signs

                Intel MKL non-reproducibility

                Relative Error for Orthogonal vectors

                Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

                Sign notreproducible

                103

                bull Consider summation or dot productbull Goals

                1 Same answer independent of layout processors order of summands

                2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

                bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

                GoalsApproaches for Reproducibility

                104

                Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

                Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

                Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

                bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

                Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

                bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

                Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

                Instruments NEC Nokia NVIDIA Samsung Oracle

                bull bebopcsberkeleyedu

                Summary

                Donrsquot Communichellip

                106

                Time to redesign all linear algebra n-body hellip algorithms and software

                (and compilers)

                • Implementing Communication-Avoiding Algorithms
                • Why avoid communication
                • Goals
                • Outline
                • Outline (2)
                • Lower bound for all ldquon3-likerdquo linear algebra
                • Lower bound for all ldquon3-likerdquo linear algebra (2)
                • Lower bound for all ldquon3-likerdquo linear algebra (3)
                • Limits to parallel scaling (12)
                • Limits to parallel scaling (22)
                • Can we attain these lower bounds
                • Outline (3)
                • 25D Matrix Multiplication
                • 25D Matrix Multiplication (2)
                • 25D Matmul on BGP 16K nodes 64K cores (2)
                • Perfect Strong Scaling ndash in Time and Energy (12)
                • Perfect Strong Scaling ndash in Time and Energy (22)
                • Handling Heterogeneity
                • Application to Tensor Contractions
                • C(ijk) = Σm A(ijm)B(mk)
                • Application to Tensor Contractions (2)
                • Communication Lower Bounds for Strassen-like matmul algorithms
                • vs
                • Slide 26
                • Strassen-like beyond matmul
                • Cache and Network Oblivious Algorithms
                • CARMA Performance Distributed Memory
                • CARMA Performance Distributed Memory (2)
                • CARMA Performance Shared Memory
                • CARMA Performance Shared Memory (2)
                • Why is CARMA Faster in Shared Memory
                • Outline (4)
                • One-sided Factorizations (LU QR) so far
                • TSQR An Architecture-Dependent Algorithm
                • Back to LU Using similar idea for TSLU as TSQR Use reduction
                • Minimizing Communication in TSLU
                • Making TSLU Numerically Stable
                • Stability of LU using TSLU CALU
                • Why is stability of TSLU just a ldquoThmrdquo
                • Fixing TSLU
                • 2D CALU with Tournament Pivoting
                • 25D CALU with Tournament Pivoting (c=4 copies)
                • Exascale Machine Parameters Source DOE Exascale Workshop
                • Exascale predicted speedups for Gaussian Elimination 2D CA
                • 25D vs 2D LU With and Without Pivoting
                • Other CA algorithms for Ax=b least squares(13)
                • Other CA algorithms for Ax=b least squares (23)
                • Other CA algorithms for Ax=b least squares (33)
                • Outline (5)
                • What about sparse matrices (13)
                • Performance of 25D APSP using Kleene
                • What about sparse matrices (23)
                • What about sparse matrices (33)
                • Outline (6)
                • Symmetric Eigenproblem and SVD
                • Slide 58
                • Slide 59
                • Slide 60
                • Slide 61
                • Slide 62
                • Slide 63
                • Slide 64
                • Slide 65
                • Slide 66
                • Slide 67
                • Slide 68
                • Conventional vs CA - SBR
                • Speedups of Sym Band Reduction vs DSBTRD
                • Nonsymmetric Eigenproblem
                • Attaining the Lower bounds Sequential
                • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
                • Outline (7)
                • Avoiding Communication in Iterative Linear Algebra
                • Outline (8)
                • Example The Difficulty of Tuning SpMV
                • Example The Difficulty of Tuning
                • Speedups on Itanium 2 The Need for Search
                • Register Profile Itanium 2
                • Register Profiles IBM and Intel IA-64
                • Another example of tuning challenges for SpMV
                • Zoom in to top corner
                • 3x3 blocks look natural buthellip
                • Extra Work Can Improve Efficiency
                • Slide 86
                • Slide 87
                • Slide 88
                • Slide 89
                • Summary of Other Performance Optimizations
                • Optimized Sparse Kernel Interface - OSKI
                • Outline (9)
                • Example Classical Conjugate Gradient (CG)
                • Example CA-Conjugate Gradient
                • Outline (10)
                • Slide 96
                • Slide 97
                • Outline (11)
                • What is a ldquosparse matrixrdquo
                • Outline (12)
                • Reproducible Floating Point Computation
                • Intel MKL non-reproducibility
                • GoalsApproaches for Reproducibility
                • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
                • Collaborators and Supporters
                • Summary

                  Limits to parallel scaling (12)bull Consider dense case flops_per_proc = n3P

                  ndash Words = (n3(PM12 ))ndash Messages = (n3(PM32 ))

                  bull What is M Must be at least n2P to hold datandash Words = (n2P12 )ndash Messages = (P12 )

                  bull But if M fixed looks like perfect strong scaling in timendash Flops Words Messages all proportional to 1P

                  bull Ditto for energy if we count energy costs in joules hellipndash Per flop per word moved per messagendash Per word per second for data stored in memory Mndash Per second for leakage cooling hellip

                  bull How big can we make P and M

                  Limits to parallel scaling (22)

                  bull Consider dense case flops_per_proc = n3Pndash Words = (n3(PM12 ))ndash Messages = (n3(PM32 ))

                  bull How big can we make P and Mbull Assume we start with 1 copy of inputs A and B

                  ndash Otherwise no communication may be neededbull Thm Words= (n2P23 ) independent of M

                  bull Reached when M = n2P23 too or P = n3M32 and Messages = (1) (log P in practice)bull Attained by 25D algorithm when c=P13 (ldquo3D algrdquo)bull Can keep increasing P until P = n3 Words = Messages = (1) (log n in practice)

                  Can we attain these lower bounds

                  bull Do conventional dense algorithms as implemented in LAPACK and ScaLAPACK attain these boundsndash Often not

                  bull If not are there other algorithms that dondash Yes for much of dense linear algebrandash New algorithms with new numerical properties

                  new ways to encode answers new data structures

                  ndash Not just loop transformations (need those too)bull Only a few sparse algorithms so farbull Lots of work in progress

                  ndash Algorithms Energy Heterogeneous Processors hellip11

                  Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                  ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                  ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                  bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                  bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                  25D Matrix Multiplication

                  bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid

                  c

                  (Pc)12

                  (Pc)12

                  Example P = 32 c = 2

                  25D Matrix Multiplication

                  bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid

                  k

                  j

                  iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12

                  (1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)

                  (3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)

                  25D Matmul on BGP 16K nodes 64K coresc = 16 copies

                  Distinguished Paper Award EuroParrsquo11 (Solomonik D)SCrsquo11 paper by Solomonik Bhatele D

                  12x faster

                  27x faster

                  Perfect Strong Scaling ndash in Time and Energy (12)

                  bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2

                  bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model

                  ndash γT βT αT = secs per flop per word_moved per message of size m

                  bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ] = T(P)cbull Notation for energy model

                  ndash γE βE αE = joules for same operations

                  ndash δE = joules per word of memory used per sec

                  ndash εE = joules per sec for leakage etc

                  bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP) = E(P)bull Perfect scaling extends to N-body Strassen hellip

                  Perfect Strong Scaling ndash in Time and Energy (22)

                  bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ] = T(P)cbull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP) = E(P)

                  bull Can use these formulas to answer many questions such asndash How to choose p and M to minimize energy E needed for computationndash Given max allowed runtime T what is minimum energy E needed to achieve itndash Given max allowed energy E what is the minimum runtime T attainablendash Can we minimize the average power P = ETndash Given target energy efficiency what architectural parameters are needed to

                  achieve itbull Can we attain 75 GflopsWattbull Can we attain an exaflop for 20 MWatts

                  Handling Heterogeneitybull Suppose each of P processors could differ

                  ndash γi = secflop βi = secword αi = secmessage Mi = memory

                  bull What is optimal assignment of work Fi to minimize timendash Ti = Fi γi + Fi βi Mi

                  12 + Fi αi Mi32 = Fi [γi + βi Mi

                  12 + αi Mi32] = Fi ξi

                  ndash Choose Fi so Σi Fi = n3 and minimizing T = maxi Ti

                  ndash Answer Fi = n3(1ξi)Σj(1ξj) and T = n3Σj(1ξj)

                  bull Optimal Algorithm for nxn matmulndash Recursively divide into 8 half-sized subproblemsndash Assign subproblems to processor i to add up to Fi flops

                  bull Works for Strassen other algorithmshellip

                  Application to Tensor Contractions

                  bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

                  bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

                  bull Heavily used in electronic structure calculationsndash Ex NWChem

                  bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetries

                  ndash Solomonik Hammond Matthews

                  C(ijk) = Σm A(ijm)B(mk)

                  A3-fold symm

                  B2-fold symm

                  C2-fold symm

                  Application to Tensor Contractions

                  bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

                  bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

                  bull Heavily used in electronic structure calculationsndash Ex NWChem for coupled cluster (CC) approach to Schroedinger eqn

                  bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetriesndash Up to 3x faster running CC than NWChem on 3072 cores of Cray XE6 ndash Solomonik Hammond Matthews

                  Communication Lower Bounds for Strassen-like matmul algorithms

                  bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

                  bull Extends up to M = n2 p2ω bull Extends to rectangular case multiply (mxn)(nxp) in q mults

                  ndash words_moved = Ω (flopsM^(logmpq -1))

                  bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz also in JACMbull Is the lower bound attainable

                  Classical O(n3) matmul

                  words_moved =Ω (M(nM12)3P)

                  Strassenrsquos O(nlg7) matmul

                  words_moved =Ω (M(nM12)lg7P)

                  Strassen-like O(nω) matmul

                  words_moved =Ω (M(nM12)ωP)

                  vs

                  Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

                  Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

                  CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

                  Communication Avoiding Parallel Strassen (CAPS)

                  Best way to interleaveBFS and DFS is an tuning parameter

                  26

                  Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

                  Speedups 24-184(over previous Strassen-based algorithms)

                  Invited to appear as Research Highlight in CACM

                  Strassen-like beyond matmul

                  bull Thm (D Dumitriu Holtzrsquo07) Any Strassen-like O(nω) matmul algorithm can be used to build a numerically stable O(nω+η) algorithm for any ηgt0 for Ax=b least squares eig SVD hellipndash ηgt0 needed to deal with numerical stabilityndash Strassen already stable so η=0

                  bull Thm For sequential versions of these algorithms Words_moved = O(nω+ηM(ω+η)2 ndash 1 + n2 log n) ie attain expected lower bound

                  Ballard D Holtz Schwartz

                  Cache and Network Oblivious Algorithms

                  bull Motivation Minimizes communication at every level of a hierarchical system without tuning parameters (in theory)ndash Not always 25D Matmul on BGP was topology aware

                  bull CAPS Divide-and-conquer choose BFS or DFS to adapt to processors available memory

                  bull CARMAndash Divide-and-conquer classical matmul divide largest of 3

                  dimensions to create two subproblemsndash Choose BFS or DFS to adapt to processors available memory

                  CARMA Performance Distributed Memory

                  Square m = k = n = 6144

                  ScaLAPACK

                  CARMA

                  Peak

                  (log)

                  (log)

                  Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA

                  CARMA Performance Distributed Memory

                  Inner Product m = n = 192 k = 6291456

                  ScaLAPACK

                  CARMAPeak

                  (log)

                  (log)

                  Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA

                  CARMA Performance Shared Memory

                  Square m = k = n

                  MKL (double)CARMA (double)

                  MKL (single)CARMA (single)

                  Peak (single)

                  Peak (double)

                  (log)

                  (linear)

                  Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

                  CARMA Performance Shared Memory

                  Inner Product m = n = 64

                  MKL (double)

                  CARMA (double)

                  MKL (single)

                  CARMA (single)

                  (log)

                  (linear)

                  Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

                  Why is CARMA Faster in Shared MemoryL3 Cache Misses

                  Shared Memory Inner Product (m = n = 64 k = 524288)

                  97 Fewer Misses

                  86 Fewer Misses

                  (linear)

                  Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                  ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                  ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                  bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                  bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                  One-sided Factorizations (LU QR) so farbull Classical Approach for i=1 to n update column i update trailing matrixbull words_moved = O(n3)

                  35

                  bull Blocked Approach (LAPACK) for i=1 to nb update block i of b columns update trailing matrixbull words moved = O(n3M13)

                  bull Recursive Approach func factor(A) if A has 1 column update it

                  else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)

                  bull None of these approaches minimizes messagesbull Parallel case Partial

                  Pivoting =gt n reductionsbull Need another idea

                  TSQR An Architecture-Dependent Algorithm

                  W =

                  W0

                  W1

                  W2

                  W3

                  R00

                  R10

                  R20

                  R30

                  R01

                  R11

                  R02Parallel

                  W =

                  W0

                  W1

                  W2

                  W3

                  R01 R02

                  R00

                  R03

                  SequentialStreaming

                  W =

                  W0

                  W1

                  W2

                  W3

                  R00

                  R01R01

                  R11

                  R02

                  R11

                  R03

                  Dual Core

                  Can choose reduction tree dynamicallyMulticore Multisocket Multirack Multisite Out-of-core

                  Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

                  Wnxb =

                  W1

                  W2

                  W3

                  W4

                  P1middotL1middotU1

                  P2middotL2middotU2

                  P3middotL3middotU3

                  P4middotL4middotU4

                  =

                  Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

                  W1rsquoW2rsquoW3rsquoW4rsquo

                  P12middotL12middotU12

                  P34middotL34middotU34

                  =Choose b pivot rows call them W12rsquo

                  Choose b pivot rows call them W34rsquo

                  W12rsquoW34rsquo

                  = P1234middotL1234middotU1234

                  Choose b pivot rows

                  Go back to W and use these b pivot rows (move them to top do LU without pivoting)

                  37

                  Minimizing Communication in TSLU

                  W = W1

                  W2

                  W3

                  W4

                  LULULULU

                  LU

                  LULUParallel

                  W = W1

                  W2

                  W3

                  W4

                  LULU

                  LU

                  LUSequentialStreaming

                  W = W1

                  W2

                  W3

                  W4

                  LULU LU

                  LULU

                  LULU

                  Dual Core

                  Can choose reduction tree dynamically to match architecture as before

                  38

                  Making TSLU Numerically Stable

                  bull Details matterndash Going up the tree we could do LU either on original rows of A

                  (tournament pivoting) or computed rows of Undash Only tournament pivoting stable

                  bull ldquoThmrdquo New scheme as stable as Partial Pivoting (GEPP) in following sense Get same Schur complements as GEPP applied to different input matrix whose entries are blocks taken from input A

                  bull Why just a ldquoThmrdquo

                  39

                  Stability of LU using TSLU CALU

                  Summer School Lecture 4 40

                  bull Empirical testingndash Both random matrices and ldquospecial onesrdquondash Both binary tree (BCALU) and flat-tree (FCALU)ndash 3 metrics ||PA-LU||||A|| normwise and componentwise backward errorsndash See [D Grigori Xiang 2010] for details

                  Why is stability of TSLU just a ldquoThmrdquo

                  bull Proof is correct ndash in exact arithmeticbull Experiment

                  ndash Generate 100 random 6x6 rank 3 matrices in Matlabndash [LUP] = lu(A) do LU without pivoting on PA compare L factors are

                  they the samebull Compute || L ndash Lnp || A few 0rsquos A few infinrsquos a few NaNsbull Rest mostly O(1)

                  ndash Why Floating point is nonassociative doing arithmetic in different order gives different rounding errors

                  ndash Same experiment with rank 6 matrices || L ndash Lnp || usually nonzero O(macheps)

                  ndash Same experiment with 20x20 rank 4 matrices || L ndash Lnp || often O(103)

                  bull Much harder to break TSLU but possiblendash Occurred when using TSLU to factorize a low-rank subdiagonal

                  panel in symmetric-indefinite factorization 41

                  Fixing TSLU

                  bull Run TSLU quickly test for stability fix if necessary (rare)

                  bull Test conditioning of U if not tiny (usual case) proceed elsebull Compute || L || if not big (usual case) proceed elsebull Factor A = QR using TSQR thenbull Factor Q = PLU using TSLU thenbull A = PL(UR) with UR as upper triangular factor

                  bull Last topic in lecture how to guarantee floating point reproducibility

                  42

                  2D CALU with Tournament Pivoting

                  43

                  25D CALU with Tournament Pivoting (c=4 copies)

                  44

                  Exascale Machine ParametersSource DOE Exascale Workshop

                  bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

                  Exascale predicted speedupsfor Gaussian Elimination

                  2D CA-LU vs ScaLAPACK-LU

                  log2 (p)

                  log 2

                  (n2 p

                  ) =

                  log 2

                  (mem

                  ory_

                  per_

                  proc

                  )

                  Up to 29x

                  25D vs 2D LUWith and Without Pivoting

                  Other CA algorithms for Ax=b least squares(13)

                  bull A symmetric and indefinitendash Seek factorization that retains symmetry PAPT = LDLT D

                  ldquosimplerdquobull Save frac12 flops preserve inertia

                  ndash Usual approach Bunch-Kaufmanbull D block diagonal with 1x1 and 2x2 blocksbull Pivot search down column along row (lots of communication)

                  ndash Alternative Aasenbull D = tridiagonal = Tbull Two steps

                  ndash PAPT = LTLT where T is banded using TSLU

                  48

                  0 0

                  0

                  0 0

                  0

                  0

                  hellip

                  hellip

                  ndash Solvefactor narrow band problem with Tbull Up to 28x faster than MKL Best Paper at IPDPSrsquo13

                  Other CA algorithms for Ax=b least squares (23)bull Minimizing bandwidth and latency for sequential GEPP

                  ndash So far could not do partial pivoting and minimize messages just words

                  ndash Challengebull Column layout good for choosing pivots bad for matmulbull Blocked layout good for matmul bad for choosing pivots

                  ndash Solution use both layouts switching between thembull ldquoShape Morphing LUrdquo or SMLU

                  49

                  bull func factor(A) if A has 1 column update it else factor(left half of A)

                  update right half of A

                  factor(right half of A)

                  bull Words = O(n3M12)

                  bull Messages = O(n3M)

                  bull func factor(A) if A has 1 column update it else factor(left half of A) reshape to recursive block format update right half of A reshape to columnwise format factor(right half of A)

                  bull Words = O(n3M12)

                  bull Messages = O(n3M32)

                  Other CA algorithms for Ax=b least squares (33)bull Need for pivoting arises beyond LU in QR

                  ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)

                  ndash Usual approach like Partial Pivotingbull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound

                  ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two

                  groups of b columns either using usual approach or something better (GuEisenstat)

                  bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix

                  ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 50

                  Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                  ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                  ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                  bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                  bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                  What about sparse matrices (13)

                  bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

                  bull But canrsquot reorder outer loop for 25D need another ideabull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = AB

                  ndash Dependencies ok 25D works just different semiringbull Kleenersquos Algorithm

                  52

                  for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

                  D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

                  Performance of 25D APSP using Kleene

                  53

                  Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

                  62xspeedup

                  2x speedup

                  What about sparse matrices (23)

                  bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of G(A)

                  have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

                  bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering for

                  2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

                  bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

                  multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

                  separators)

                  54

                  What about sparse matrices (33)

                  bull If matrix stays very sparse lower bound unattainable new one bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

                  data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

                  bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

                  Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

                  bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

                  along dimensions most likely to minimize cost55

                  Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                  ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                  ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                  bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                  bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                  Symmetric Eigenproblem and SVD

                  bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd

                  bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12

                  ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea

                  b+1

                  b+1

                  Successive Band Reduction (BischofLangSun)

                  1

                  b+1

                  b+1

                  d+1

                  c

                  Successive Band Reduction (BischofLangSun)

                  b = bandwidthc = columnsd = diagonalsConstraint c+d b

                  1Q1

                  b+1

                  b+1

                  d+1

                  c

                  b = bandwidthc = columnsd = diagonalsConstraint c+d b

                  Successive Band Reduction (BischofLangSun)

                  12

                  Q1

                  b+1

                  b+1

                  d+1

                  d+c

                  d+c

                  c

                  b = bandwidthc = columnsd = diagonalsConstraint c+d b

                  Successive Band Reduction (BischofLangSun)

                  1

                  12

                  Q1

                  Q1T

                  b+1

                  b+1

                  d+1

                  d+1

                  cd+c

                  d+c

                  c

                  b = bandwidthc = columnsd = diagonalsConstraint c+d b

                  Successive Band Reduction (BischofLangSun)

                  1

                  1

                  2

                  2Q1

                  Q1T

                  b+1

                  b+1

                  d+1

                  d+1

                  cd+c

                  d+c

                  d+c

                  d+c

                  c

                  b = bandwidthc = columnsd = diagonalsConstraint c+d b

                  Successive Band Reduction (BischofLangSun)

                  1

                  1

                  2

                  2

                  3

                  3

                  Q1

                  Q1T

                  Q2

                  Q2T

                  b+1

                  b+1

                  d+1

                  d+1

                  d+c

                  d+c

                  d+c

                  d+c

                  c

                  c

                  b = bandwidthc = columnsd = diagonalsConstraint c+d b

                  Successive Band Reduction (BischofLangSun)

                  1

                  1

                  2

                  2

                  3

                  3

                  4

                  4

                  Q1

                  Q1T

                  Q2

                  Q2T

                  Q3

                  Q3T

                  b+1

                  b+1

                  d+1

                  d+1

                  d+c

                  d+c

                  d+c

                  d+c

                  c

                  c

                  b = bandwidthc = columnsd = diagonalsConstraint c+d b

                  Successive Band Reduction (BischofLangSun)

                  1

                  1

                  2

                  2

                  3

                  3

                  4

                  4

                  5

                  5

                  Q1

                  Q1T

                  Q2

                  Q2T

                  Q3

                  Q3T

                  Q4

                  Q4T

                  b+1

                  b+1

                  d+1

                  d+1

                  c

                  c

                  d+c

                  d+c

                  d+c

                  d+c

                  b = bandwidthc = columnsd = diagonalsConstraint c+d b

                  Successive Band Reduction (BischofLangSun)

                  1

                  1

                  2

                  2

                  3

                  3

                  4

                  4

                  5

                  5

                  Q5T

                  Q1

                  Q1T

                  Q2

                  Q2T

                  Q3

                  Q3T

                  Q5

                  Q4

                  Q4T

                  b+1

                  b+1

                  d+1

                  d+1

                  c

                  c

                  d+c

                  d+c

                  d+c

                  d+c

                  b = bandwidthc = columnsd = diagonalsConstraint c+d b

                  Successive Band Reduction (BischofLangSun)

                  1

                  1

                  2

                  2

                  3

                  3

                  4

                  4

                  5

                  5

                  6

                  6

                  Q5T

                  Q1

                  Q1T

                  Q2

                  Q2T

                  Q3

                  Q3T

                  Q5

                  Q4

                  Q4T

                  b+1

                  b+1

                  d+1

                  d+1

                  c

                  c

                  d+c

                  d+c

                  d+c

                  d+c

                  b = bandwidthc = columnsd = diagonalsConstraint c+d b

                  Successive Band Reduction (BischofLangSun)

                  Conventional vs CA - SBR

                  Conventional Communication-Avoiding

                  Touch all data 4 times Touch all data once

                  >
                  >

                  Speedups of Sym Band Reductionvs DSBTRD

                  bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

                  bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

                  bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

                  bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

                  bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

                  Nonsymmetric Eigenproblem

                  bull No apparent way to modify standard algorithmbull Instead Spectral Divide-and-Conquer

                  ndash Find orthogonal matrix Q whose leading columns span an invariant subspace of A

                  ndash QTAQ will be block upper triangularndash Apply recursively to A11 A22

                  ndash Depends on randomization1 Randomized Rank Revealing QR decomposition2 Randomized location to try splitting spectrum

                  A11 A12

                  ε A22

                  Attaining the Lower bounds SequentialLegend[Existing][Ours][Math-Lib][Random]

                  Two Levels Memory Hierarchy

                  Words Messages Words Messages

                  BLAS-3 [FLPRrsquo99][BDLSTrsquo13][MKL etc] [FLPRrsquo99][BDLSTrsquo13][MKL etc]

                  Cholesky[Grsquo97][APrsquo00]

                  [LAPACK][BDHSrsquo09]

                  [Grsquo97][APrsquo00][BDHSrsquo09] [Grsquo97][APrsquo00][BDHSrsquo09]

                  Sym Indefinite [BBDDDPSTYrsquo13] [BBDDDPSTYrsquo13]

                  LU[Grsquo97][Trsquo97]

                  [GDXrsquo11][BDLSTrsquo13]

                  [GDXrsquo11][BDLSTrsquo13]

                  [Grsquo97][Trsquo97] [BDLSTrsquo13] [BDLSTrsquo13]

                  QR[EGrsquo98][FWrsquo03]

                  [DGHLrsquo12][BDLSTrsquo13]

                  [FWrsquo03][DGHLrsquo12][BDLSTrsquo13]

                  [EGrsquo98][FWrsquo03][BDLSTrsquo13]

                  [FWrsquo03][BDLSTrsquo13]

                  Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

                  Sym Eig amp SVD [BDDrsquo11][BDKrsquo13] [BDDrsquo11]

                  Non Sym Eig [BDDrsquo11] [BDDrsquo11]

                  Legend[Existing][Ours][Math-Lib][Random]

                  Words (BW) Messages (L) Saving factor

                  BLAS-3 [AGZrsquo94][MTrsquo99][ScaLAPACK][Crsquo69][vGWrsquo97][SDrsquo11] L nP12

                  Cholesky [ScaLAPACK][Trsquo99][SDrsquo11] L nP12

                  Sym Indefinite [BBDDDPSTYrsquo13][ScaLAPACK] [BBDDDPSTYrsquo13] L nP12

                  LU [ScaLAPACK][GDXrsquo11][Trsquo99][SDrsquo11] [GDXrsquo11][Trsquo99][SDrsquo11] L nP12

                  QR [ScaLAPACK][DGHLrsquo12] [Trsquo99] [DGHLrsquo12][Trsquo99] L nP12

                  Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

                  Sym Eig amp SVD [BDDrsquo11][BDKrsquo13][ScaLAPACK] [BDDrsquo11][BDKrsquo13] L nP12

                  Non-Sym Eig [BDDrsquo11] [BDDrsquo11] BW P12 L n

                  Attaining with extra memory 25D M=(cn2P)

                  Attaining the Lower bounds Parallel 2DM=(n2P)(Ignoring poly-log(P) factors words = ( n2 P12) messages = (P12)

                  Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                  ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                  ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                  bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                  bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                  Avoiding Communication in Iterative Linear Algebra

                  bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

                  bull Conjugate Gradients (CG) GMRES Lanczos Arnoldi hellip bull Goal minimize communication

                  ndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

                  bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

                  ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

                  bull Lots of speed up possible (modeled and measured)ndash Price some redundant computationndash Challenges Poor partitioning Preconditioning Num Stability

                  75

                  Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                  ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                  ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                  bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                  bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                  Example The Difficulty of Tuning SpMV

                  bull n = 21200bull nnz = 15 M

                  bull Source NASA structural analysis problem (raefsky)

                  77

                  Example The Difficulty of Tuning

                  bull n = 21200bull nnz = 15 M

                  bull Source NASA structural analysis problem (raefsky)

                  bull 8x8 dense substructure exploit this to limit mem_refs

                  78

                  Speedups on Itanium 2 The Need for Search

                  Reference

                  Best 4x2

                  Mflops

                  Mflops

                  79

                  Register Profile Itanium 2

                  190 Mflops

                  1190 Mflops

                  80

                  Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16

                  Itanium 2 - 33Itanium 1 - 8

                  252 Mflops

                  122 Mflops

                  820 Mflops

                  459 Mflops

                  247 Mflops

                  107 Mflops

                  12 Gflops

                  190 Mflops

                  Another example of tuning challenges for SpMV

                  bull Ex11 matrix (fluid flow)

                  bull More complicated non-zero structure in general

                  bull N = 16614bull NNZ = 11M

                  82

                  Zoom in to top corner

                  bull More complicated non-zero structure in general

                  bull N = 16614bull NNZ = 11M

                  83

                  3x3 blocks look natural buthellip

                  bull Example 3x3 blockingndash Logical grid of 3x3 cells

                  bull But would lead to lots of ldquofill-inrdquo

                  84

                  Extra Work Can Improve Efficiency

                  bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

                  bull On Pentium III 15x speedupndash Actual mflop rate 152 = 225 higher

                  85

                  Source Accelerator Cavity Design Problem (Ko via Husbands)

                  86

                  100x100 Submatrix Along Diagonal

                  Summer School Lecture 787

                  Post-RCM Reordering

                  88

                  Effect of Combined RCM+TSP Reordering

                  Before Green + RedAfter Green + Blue

                  Summer School Lecture 789

                  2x speedups on Pentium 4 Power 4 hellip

                  Summary of Other Performance Optimizations

                  bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

                  bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

                  bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

                  90

                  Optimized Sparse Kernel Interface - OSKI

                  bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

                  bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

                  bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

                  software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

                  91

                  Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                  ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                  ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                  bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                  bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                  93

                  Example Classical Conjugate Gradient (CG)

                  SpMVs and dot products require communication in

                  each iteration

                  via CA Matrix Powers Kernel

                  Global reduction to compute G

                  94

                  Example CA-Conjugate Gradient

                  Local computations within inner loop require

                  no communication

                  Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                  ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                  ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                  bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                  bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                  96

                  Slower convergence due

                  to roundoff

                  Loss of accuracy due to roundoff

                  At s = 16 monomial basis is rank deficient Method breaks down

                  Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

                  CA-CG (monomial)CG

                  machine precision

                  97

                  Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                  ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                  ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                  bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                  bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                  What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

                  bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

                  bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

                  matrices

                  Explicit (O(nnz)) Implicit (o(nnz))

                  Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

                  Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

                  Indices

                  Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                  ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                  ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                  bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                  bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                  101

                  bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

                  ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

                  demanded by customers (construction engineers) otherwise they donrsquot believe results

                  ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

                  ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

                  Reproducible Floating Point Computation

                  Absolute Error for Random Vectors

                  Same magnitude opposite signs

                  Intel MKL non-reproducibility

                  Relative Error for Orthogonal vectors

                  Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

                  Sign notreproducible

                  103

                  bull Consider summation or dot productbull Goals

                  1 Same answer independent of layout processors order of summands

                  2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

                  bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

                  GoalsApproaches for Reproducibility

                  104

                  Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

                  Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

                  Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

                  bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

                  Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

                  bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

                  Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

                  Instruments NEC Nokia NVIDIA Samsung Oracle

                  bull bebopcsberkeleyedu

                  Summary

                  Donrsquot Communichellip

                  106

                  Time to redesign all linear algebra n-body hellip algorithms and software

                  (and compilers)

                  • Implementing Communication-Avoiding Algorithms
                  • Why avoid communication
                  • Goals
                  • Outline
                  • Outline (2)
                  • Lower bound for all ldquon3-likerdquo linear algebra
                  • Lower bound for all ldquon3-likerdquo linear algebra (2)
                  • Lower bound for all ldquon3-likerdquo linear algebra (3)
                  • Limits to parallel scaling (12)
                  • Limits to parallel scaling (22)
                  • Can we attain these lower bounds
                  • Outline (3)
                  • 25D Matrix Multiplication
                  • 25D Matrix Multiplication (2)
                  • 25D Matmul on BGP 16K nodes 64K cores (2)
                  • Perfect Strong Scaling ndash in Time and Energy (12)
                  • Perfect Strong Scaling ndash in Time and Energy (22)
                  • Handling Heterogeneity
                  • Application to Tensor Contractions
                  • C(ijk) = Σm A(ijm)B(mk)
                  • Application to Tensor Contractions (2)
                  • Communication Lower Bounds for Strassen-like matmul algorithms
                  • vs
                  • Slide 26
                  • Strassen-like beyond matmul
                  • Cache and Network Oblivious Algorithms
                  • CARMA Performance Distributed Memory
                  • CARMA Performance Distributed Memory (2)
                  • CARMA Performance Shared Memory
                  • CARMA Performance Shared Memory (2)
                  • Why is CARMA Faster in Shared Memory
                  • Outline (4)
                  • One-sided Factorizations (LU QR) so far
                  • TSQR An Architecture-Dependent Algorithm
                  • Back to LU Using similar idea for TSLU as TSQR Use reduction
                  • Minimizing Communication in TSLU
                  • Making TSLU Numerically Stable
                  • Stability of LU using TSLU CALU
                  • Why is stability of TSLU just a ldquoThmrdquo
                  • Fixing TSLU
                  • 2D CALU with Tournament Pivoting
                  • 25D CALU with Tournament Pivoting (c=4 copies)
                  • Exascale Machine Parameters Source DOE Exascale Workshop
                  • Exascale predicted speedups for Gaussian Elimination 2D CA
                  • 25D vs 2D LU With and Without Pivoting
                  • Other CA algorithms for Ax=b least squares(13)
                  • Other CA algorithms for Ax=b least squares (23)
                  • Other CA algorithms for Ax=b least squares (33)
                  • Outline (5)
                  • What about sparse matrices (13)
                  • Performance of 25D APSP using Kleene
                  • What about sparse matrices (23)
                  • What about sparse matrices (33)
                  • Outline (6)
                  • Symmetric Eigenproblem and SVD
                  • Slide 58
                  • Slide 59
                  • Slide 60
                  • Slide 61
                  • Slide 62
                  • Slide 63
                  • Slide 64
                  • Slide 65
                  • Slide 66
                  • Slide 67
                  • Slide 68
                  • Conventional vs CA - SBR
                  • Speedups of Sym Band Reduction vs DSBTRD
                  • Nonsymmetric Eigenproblem
                  • Attaining the Lower bounds Sequential
                  • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
                  • Outline (7)
                  • Avoiding Communication in Iterative Linear Algebra
                  • Outline (8)
                  • Example The Difficulty of Tuning SpMV
                  • Example The Difficulty of Tuning
                  • Speedups on Itanium 2 The Need for Search
                  • Register Profile Itanium 2
                  • Register Profiles IBM and Intel IA-64
                  • Another example of tuning challenges for SpMV
                  • Zoom in to top corner
                  • 3x3 blocks look natural buthellip
                  • Extra Work Can Improve Efficiency
                  • Slide 86
                  • Slide 87
                  • Slide 88
                  • Slide 89
                  • Summary of Other Performance Optimizations
                  • Optimized Sparse Kernel Interface - OSKI
                  • Outline (9)
                  • Example Classical Conjugate Gradient (CG)
                  • Example CA-Conjugate Gradient
                  • Outline (10)
                  • Slide 96
                  • Slide 97
                  • Outline (11)
                  • What is a ldquosparse matrixrdquo
                  • Outline (12)
                  • Reproducible Floating Point Computation
                  • Intel MKL non-reproducibility
                  • GoalsApproaches for Reproducibility
                  • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
                  • Collaborators and Supporters
                  • Summary

                    Limits to parallel scaling (22)

                    bull Consider dense case flops_per_proc = n3Pndash Words = (n3(PM12 ))ndash Messages = (n3(PM32 ))

                    bull How big can we make P and Mbull Assume we start with 1 copy of inputs A and B

                    ndash Otherwise no communication may be neededbull Thm Words= (n2P23 ) independent of M

                    bull Reached when M = n2P23 too or P = n3M32 and Messages = (1) (log P in practice)bull Attained by 25D algorithm when c=P13 (ldquo3D algrdquo)bull Can keep increasing P until P = n3 Words = Messages = (1) (log n in practice)

                    Can we attain these lower bounds

                    bull Do conventional dense algorithms as implemented in LAPACK and ScaLAPACK attain these boundsndash Often not

                    bull If not are there other algorithms that dondash Yes for much of dense linear algebrandash New algorithms with new numerical properties

                    new ways to encode answers new data structures

                    ndash Not just loop transformations (need those too)bull Only a few sparse algorithms so farbull Lots of work in progress

                    ndash Algorithms Energy Heterogeneous Processors hellip11

                    Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                    ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                    ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                    bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                    bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                    25D Matrix Multiplication

                    bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid

                    c

                    (Pc)12

                    (Pc)12

                    Example P = 32 c = 2

                    25D Matrix Multiplication

                    bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid

                    k

                    j

                    iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12

                    (1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)

                    (3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)

                    25D Matmul on BGP 16K nodes 64K coresc = 16 copies

                    Distinguished Paper Award EuroParrsquo11 (Solomonik D)SCrsquo11 paper by Solomonik Bhatele D

                    12x faster

                    27x faster

                    Perfect Strong Scaling ndash in Time and Energy (12)

                    bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2

                    bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model

                    ndash γT βT αT = secs per flop per word_moved per message of size m

                    bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ] = T(P)cbull Notation for energy model

                    ndash γE βE αE = joules for same operations

                    ndash δE = joules per word of memory used per sec

                    ndash εE = joules per sec for leakage etc

                    bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP) = E(P)bull Perfect scaling extends to N-body Strassen hellip

                    Perfect Strong Scaling ndash in Time and Energy (22)

                    bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ] = T(P)cbull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP) = E(P)

                    bull Can use these formulas to answer many questions such asndash How to choose p and M to minimize energy E needed for computationndash Given max allowed runtime T what is minimum energy E needed to achieve itndash Given max allowed energy E what is the minimum runtime T attainablendash Can we minimize the average power P = ETndash Given target energy efficiency what architectural parameters are needed to

                    achieve itbull Can we attain 75 GflopsWattbull Can we attain an exaflop for 20 MWatts

                    Handling Heterogeneitybull Suppose each of P processors could differ

                    ndash γi = secflop βi = secword αi = secmessage Mi = memory

                    bull What is optimal assignment of work Fi to minimize timendash Ti = Fi γi + Fi βi Mi

                    12 + Fi αi Mi32 = Fi [γi + βi Mi

                    12 + αi Mi32] = Fi ξi

                    ndash Choose Fi so Σi Fi = n3 and minimizing T = maxi Ti

                    ndash Answer Fi = n3(1ξi)Σj(1ξj) and T = n3Σj(1ξj)

                    bull Optimal Algorithm for nxn matmulndash Recursively divide into 8 half-sized subproblemsndash Assign subproblems to processor i to add up to Fi flops

                    bull Works for Strassen other algorithmshellip

                    Application to Tensor Contractions

                    bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

                    bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

                    bull Heavily used in electronic structure calculationsndash Ex NWChem

                    bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetries

                    ndash Solomonik Hammond Matthews

                    C(ijk) = Σm A(ijm)B(mk)

                    A3-fold symm

                    B2-fold symm

                    C2-fold symm

                    Application to Tensor Contractions

                    bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

                    bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

                    bull Heavily used in electronic structure calculationsndash Ex NWChem for coupled cluster (CC) approach to Schroedinger eqn

                    bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetriesndash Up to 3x faster running CC than NWChem on 3072 cores of Cray XE6 ndash Solomonik Hammond Matthews

                    Communication Lower Bounds for Strassen-like matmul algorithms

                    bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

                    bull Extends up to M = n2 p2ω bull Extends to rectangular case multiply (mxn)(nxp) in q mults

                    ndash words_moved = Ω (flopsM^(logmpq -1))

                    bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz also in JACMbull Is the lower bound attainable

                    Classical O(n3) matmul

                    words_moved =Ω (M(nM12)3P)

                    Strassenrsquos O(nlg7) matmul

                    words_moved =Ω (M(nM12)lg7P)

                    Strassen-like O(nω) matmul

                    words_moved =Ω (M(nM12)ωP)

                    vs

                    Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

                    Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

                    CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

                    Communication Avoiding Parallel Strassen (CAPS)

                    Best way to interleaveBFS and DFS is an tuning parameter

                    26

                    Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

                    Speedups 24-184(over previous Strassen-based algorithms)

                    Invited to appear as Research Highlight in CACM

                    Strassen-like beyond matmul

                    bull Thm (D Dumitriu Holtzrsquo07) Any Strassen-like O(nω) matmul algorithm can be used to build a numerically stable O(nω+η) algorithm for any ηgt0 for Ax=b least squares eig SVD hellipndash ηgt0 needed to deal with numerical stabilityndash Strassen already stable so η=0

                    bull Thm For sequential versions of these algorithms Words_moved = O(nω+ηM(ω+η)2 ndash 1 + n2 log n) ie attain expected lower bound

                    Ballard D Holtz Schwartz

                    Cache and Network Oblivious Algorithms

                    bull Motivation Minimizes communication at every level of a hierarchical system without tuning parameters (in theory)ndash Not always 25D Matmul on BGP was topology aware

                    bull CAPS Divide-and-conquer choose BFS or DFS to adapt to processors available memory

                    bull CARMAndash Divide-and-conquer classical matmul divide largest of 3

                    dimensions to create two subproblemsndash Choose BFS or DFS to adapt to processors available memory

                    CARMA Performance Distributed Memory

                    Square m = k = n = 6144

                    ScaLAPACK

                    CARMA

                    Peak

                    (log)

                    (log)

                    Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA

                    CARMA Performance Distributed Memory

                    Inner Product m = n = 192 k = 6291456

                    ScaLAPACK

                    CARMAPeak

                    (log)

                    (log)

                    Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA

                    CARMA Performance Shared Memory

                    Square m = k = n

                    MKL (double)CARMA (double)

                    MKL (single)CARMA (single)

                    Peak (single)

                    Peak (double)

                    (log)

                    (linear)

                    Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

                    CARMA Performance Shared Memory

                    Inner Product m = n = 64

                    MKL (double)

                    CARMA (double)

                    MKL (single)

                    CARMA (single)

                    (log)

                    (linear)

                    Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

                    Why is CARMA Faster in Shared MemoryL3 Cache Misses

                    Shared Memory Inner Product (m = n = 64 k = 524288)

                    97 Fewer Misses

                    86 Fewer Misses

                    (linear)

                    Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                    ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                    ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                    bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                    bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                    One-sided Factorizations (LU QR) so farbull Classical Approach for i=1 to n update column i update trailing matrixbull words_moved = O(n3)

                    35

                    bull Blocked Approach (LAPACK) for i=1 to nb update block i of b columns update trailing matrixbull words moved = O(n3M13)

                    bull Recursive Approach func factor(A) if A has 1 column update it

                    else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)

                    bull None of these approaches minimizes messagesbull Parallel case Partial

                    Pivoting =gt n reductionsbull Need another idea

                    TSQR An Architecture-Dependent Algorithm

                    W =

                    W0

                    W1

                    W2

                    W3

                    R00

                    R10

                    R20

                    R30

                    R01

                    R11

                    R02Parallel

                    W =

                    W0

                    W1

                    W2

                    W3

                    R01 R02

                    R00

                    R03

                    SequentialStreaming

                    W =

                    W0

                    W1

                    W2

                    W3

                    R00

                    R01R01

                    R11

                    R02

                    R11

                    R03

                    Dual Core

                    Can choose reduction tree dynamicallyMulticore Multisocket Multirack Multisite Out-of-core

                    Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

                    Wnxb =

                    W1

                    W2

                    W3

                    W4

                    P1middotL1middotU1

                    P2middotL2middotU2

                    P3middotL3middotU3

                    P4middotL4middotU4

                    =

                    Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

                    W1rsquoW2rsquoW3rsquoW4rsquo

                    P12middotL12middotU12

                    P34middotL34middotU34

                    =Choose b pivot rows call them W12rsquo

                    Choose b pivot rows call them W34rsquo

                    W12rsquoW34rsquo

                    = P1234middotL1234middotU1234

                    Choose b pivot rows

                    Go back to W and use these b pivot rows (move them to top do LU without pivoting)

                    37

                    Minimizing Communication in TSLU

                    W = W1

                    W2

                    W3

                    W4

                    LULULULU

                    LU

                    LULUParallel

                    W = W1

                    W2

                    W3

                    W4

                    LULU

                    LU

                    LUSequentialStreaming

                    W = W1

                    W2

                    W3

                    W4

                    LULU LU

                    LULU

                    LULU

                    Dual Core

                    Can choose reduction tree dynamically to match architecture as before

                    38

                    Making TSLU Numerically Stable

                    bull Details matterndash Going up the tree we could do LU either on original rows of A

                    (tournament pivoting) or computed rows of Undash Only tournament pivoting stable

                    bull ldquoThmrdquo New scheme as stable as Partial Pivoting (GEPP) in following sense Get same Schur complements as GEPP applied to different input matrix whose entries are blocks taken from input A

                    bull Why just a ldquoThmrdquo

                    39

                    Stability of LU using TSLU CALU

                    Summer School Lecture 4 40

                    bull Empirical testingndash Both random matrices and ldquospecial onesrdquondash Both binary tree (BCALU) and flat-tree (FCALU)ndash 3 metrics ||PA-LU||||A|| normwise and componentwise backward errorsndash See [D Grigori Xiang 2010] for details

                    Why is stability of TSLU just a ldquoThmrdquo

                    bull Proof is correct ndash in exact arithmeticbull Experiment

                    ndash Generate 100 random 6x6 rank 3 matrices in Matlabndash [LUP] = lu(A) do LU without pivoting on PA compare L factors are

                    they the samebull Compute || L ndash Lnp || A few 0rsquos A few infinrsquos a few NaNsbull Rest mostly O(1)

                    ndash Why Floating point is nonassociative doing arithmetic in different order gives different rounding errors

                    ndash Same experiment with rank 6 matrices || L ndash Lnp || usually nonzero O(macheps)

                    ndash Same experiment with 20x20 rank 4 matrices || L ndash Lnp || often O(103)

                    bull Much harder to break TSLU but possiblendash Occurred when using TSLU to factorize a low-rank subdiagonal

                    panel in symmetric-indefinite factorization 41

                    Fixing TSLU

                    bull Run TSLU quickly test for stability fix if necessary (rare)

                    bull Test conditioning of U if not tiny (usual case) proceed elsebull Compute || L || if not big (usual case) proceed elsebull Factor A = QR using TSQR thenbull Factor Q = PLU using TSLU thenbull A = PL(UR) with UR as upper triangular factor

                    bull Last topic in lecture how to guarantee floating point reproducibility

                    42

                    2D CALU with Tournament Pivoting

                    43

                    25D CALU with Tournament Pivoting (c=4 copies)

                    44

                    Exascale Machine ParametersSource DOE Exascale Workshop

                    bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

                    Exascale predicted speedupsfor Gaussian Elimination

                    2D CA-LU vs ScaLAPACK-LU

                    log2 (p)

                    log 2

                    (n2 p

                    ) =

                    log 2

                    (mem

                    ory_

                    per_

                    proc

                    )

                    Up to 29x

                    25D vs 2D LUWith and Without Pivoting

                    Other CA algorithms for Ax=b least squares(13)

                    bull A symmetric and indefinitendash Seek factorization that retains symmetry PAPT = LDLT D

                    ldquosimplerdquobull Save frac12 flops preserve inertia

                    ndash Usual approach Bunch-Kaufmanbull D block diagonal with 1x1 and 2x2 blocksbull Pivot search down column along row (lots of communication)

                    ndash Alternative Aasenbull D = tridiagonal = Tbull Two steps

                    ndash PAPT = LTLT where T is banded using TSLU

                    48

                    0 0

                    0

                    0 0

                    0

                    0

                    hellip

                    hellip

                    ndash Solvefactor narrow band problem with Tbull Up to 28x faster than MKL Best Paper at IPDPSrsquo13

                    Other CA algorithms for Ax=b least squares (23)bull Minimizing bandwidth and latency for sequential GEPP

                    ndash So far could not do partial pivoting and minimize messages just words

                    ndash Challengebull Column layout good for choosing pivots bad for matmulbull Blocked layout good for matmul bad for choosing pivots

                    ndash Solution use both layouts switching between thembull ldquoShape Morphing LUrdquo or SMLU

                    49

                    bull func factor(A) if A has 1 column update it else factor(left half of A)

                    update right half of A

                    factor(right half of A)

                    bull Words = O(n3M12)

                    bull Messages = O(n3M)

                    bull func factor(A) if A has 1 column update it else factor(left half of A) reshape to recursive block format update right half of A reshape to columnwise format factor(right half of A)

                    bull Words = O(n3M12)

                    bull Messages = O(n3M32)

                    Other CA algorithms for Ax=b least squares (33)bull Need for pivoting arises beyond LU in QR

                    ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)

                    ndash Usual approach like Partial Pivotingbull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound

                    ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two

                    groups of b columns either using usual approach or something better (GuEisenstat)

                    bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix

                    ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 50

                    Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                    ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                    ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                    bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                    bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                    What about sparse matrices (13)

                    bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

                    bull But canrsquot reorder outer loop for 25D need another ideabull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = AB

                    ndash Dependencies ok 25D works just different semiringbull Kleenersquos Algorithm

                    52

                    for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

                    D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

                    Performance of 25D APSP using Kleene

                    53

                    Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

                    62xspeedup

                    2x speedup

                    What about sparse matrices (23)

                    bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of G(A)

                    have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

                    bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering for

                    2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

                    bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

                    multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

                    separators)

                    54

                    What about sparse matrices (33)

                    bull If matrix stays very sparse lower bound unattainable new one bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

                    data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

                    bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

                    Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

                    bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

                    along dimensions most likely to minimize cost55

                    Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                    ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                    ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                    bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                    bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                    Symmetric Eigenproblem and SVD

                    bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd

                    bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12

                    ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea

                    b+1

                    b+1

                    Successive Band Reduction (BischofLangSun)

                    1

                    b+1

                    b+1

                    d+1

                    c

                    Successive Band Reduction (BischofLangSun)

                    b = bandwidthc = columnsd = diagonalsConstraint c+d b

                    1Q1

                    b+1

                    b+1

                    d+1

                    c

                    b = bandwidthc = columnsd = diagonalsConstraint c+d b

                    Successive Band Reduction (BischofLangSun)

                    12

                    Q1

                    b+1

                    b+1

                    d+1

                    d+c

                    d+c

                    c

                    b = bandwidthc = columnsd = diagonalsConstraint c+d b

                    Successive Band Reduction (BischofLangSun)

                    1

                    12

                    Q1

                    Q1T

                    b+1

                    b+1

                    d+1

                    d+1

                    cd+c

                    d+c

                    c

                    b = bandwidthc = columnsd = diagonalsConstraint c+d b

                    Successive Band Reduction (BischofLangSun)

                    1

                    1

                    2

                    2Q1

                    Q1T

                    b+1

                    b+1

                    d+1

                    d+1

                    cd+c

                    d+c

                    d+c

                    d+c

                    c

                    b = bandwidthc = columnsd = diagonalsConstraint c+d b

                    Successive Band Reduction (BischofLangSun)

                    1

                    1

                    2

                    2

                    3

                    3

                    Q1

                    Q1T

                    Q2

                    Q2T

                    b+1

                    b+1

                    d+1

                    d+1

                    d+c

                    d+c

                    d+c

                    d+c

                    c

                    c

                    b = bandwidthc = columnsd = diagonalsConstraint c+d b

                    Successive Band Reduction (BischofLangSun)

                    1

                    1

                    2

                    2

                    3

                    3

                    4

                    4

                    Q1

                    Q1T

                    Q2

                    Q2T

                    Q3

                    Q3T

                    b+1

                    b+1

                    d+1

                    d+1

                    d+c

                    d+c

                    d+c

                    d+c

                    c

                    c

                    b = bandwidthc = columnsd = diagonalsConstraint c+d b

                    Successive Band Reduction (BischofLangSun)

                    1

                    1

                    2

                    2

                    3

                    3

                    4

                    4

                    5

                    5

                    Q1

                    Q1T

                    Q2

                    Q2T

                    Q3

                    Q3T

                    Q4

                    Q4T

                    b+1

                    b+1

                    d+1

                    d+1

                    c

                    c

                    d+c

                    d+c

                    d+c

                    d+c

                    b = bandwidthc = columnsd = diagonalsConstraint c+d b

                    Successive Band Reduction (BischofLangSun)

                    1

                    1

                    2

                    2

                    3

                    3

                    4

                    4

                    5

                    5

                    Q5T

                    Q1

                    Q1T

                    Q2

                    Q2T

                    Q3

                    Q3T

                    Q5

                    Q4

                    Q4T

                    b+1

                    b+1

                    d+1

                    d+1

                    c

                    c

                    d+c

                    d+c

                    d+c

                    d+c

                    b = bandwidthc = columnsd = diagonalsConstraint c+d b

                    Successive Band Reduction (BischofLangSun)

                    1

                    1

                    2

                    2

                    3

                    3

                    4

                    4

                    5

                    5

                    6

                    6

                    Q5T

                    Q1

                    Q1T

                    Q2

                    Q2T

                    Q3

                    Q3T

                    Q5

                    Q4

                    Q4T

                    b+1

                    b+1

                    d+1

                    d+1

                    c

                    c

                    d+c

                    d+c

                    d+c

                    d+c

                    b = bandwidthc = columnsd = diagonalsConstraint c+d b

                    Successive Band Reduction (BischofLangSun)

                    Conventional vs CA - SBR

                    Conventional Communication-Avoiding

                    Touch all data 4 times Touch all data once

                    >
                    >

                    Speedups of Sym Band Reductionvs DSBTRD

                    bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

                    bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

                    bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

                    bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

                    bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

                    Nonsymmetric Eigenproblem

                    bull No apparent way to modify standard algorithmbull Instead Spectral Divide-and-Conquer

                    ndash Find orthogonal matrix Q whose leading columns span an invariant subspace of A

                    ndash QTAQ will be block upper triangularndash Apply recursively to A11 A22

                    ndash Depends on randomization1 Randomized Rank Revealing QR decomposition2 Randomized location to try splitting spectrum

                    A11 A12

                    ε A22

                    Attaining the Lower bounds SequentialLegend[Existing][Ours][Math-Lib][Random]

                    Two Levels Memory Hierarchy

                    Words Messages Words Messages

                    BLAS-3 [FLPRrsquo99][BDLSTrsquo13][MKL etc] [FLPRrsquo99][BDLSTrsquo13][MKL etc]

                    Cholesky[Grsquo97][APrsquo00]

                    [LAPACK][BDHSrsquo09]

                    [Grsquo97][APrsquo00][BDHSrsquo09] [Grsquo97][APrsquo00][BDHSrsquo09]

                    Sym Indefinite [BBDDDPSTYrsquo13] [BBDDDPSTYrsquo13]

                    LU[Grsquo97][Trsquo97]

                    [GDXrsquo11][BDLSTrsquo13]

                    [GDXrsquo11][BDLSTrsquo13]

                    [Grsquo97][Trsquo97] [BDLSTrsquo13] [BDLSTrsquo13]

                    QR[EGrsquo98][FWrsquo03]

                    [DGHLrsquo12][BDLSTrsquo13]

                    [FWrsquo03][DGHLrsquo12][BDLSTrsquo13]

                    [EGrsquo98][FWrsquo03][BDLSTrsquo13]

                    [FWrsquo03][BDLSTrsquo13]

                    Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

                    Sym Eig amp SVD [BDDrsquo11][BDKrsquo13] [BDDrsquo11]

                    Non Sym Eig [BDDrsquo11] [BDDrsquo11]

                    Legend[Existing][Ours][Math-Lib][Random]

                    Words (BW) Messages (L) Saving factor

                    BLAS-3 [AGZrsquo94][MTrsquo99][ScaLAPACK][Crsquo69][vGWrsquo97][SDrsquo11] L nP12

                    Cholesky [ScaLAPACK][Trsquo99][SDrsquo11] L nP12

                    Sym Indefinite [BBDDDPSTYrsquo13][ScaLAPACK] [BBDDDPSTYrsquo13] L nP12

                    LU [ScaLAPACK][GDXrsquo11][Trsquo99][SDrsquo11] [GDXrsquo11][Trsquo99][SDrsquo11] L nP12

                    QR [ScaLAPACK][DGHLrsquo12] [Trsquo99] [DGHLrsquo12][Trsquo99] L nP12

                    Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

                    Sym Eig amp SVD [BDDrsquo11][BDKrsquo13][ScaLAPACK] [BDDrsquo11][BDKrsquo13] L nP12

                    Non-Sym Eig [BDDrsquo11] [BDDrsquo11] BW P12 L n

                    Attaining with extra memory 25D M=(cn2P)

                    Attaining the Lower bounds Parallel 2DM=(n2P)(Ignoring poly-log(P) factors words = ( n2 P12) messages = (P12)

                    Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                    ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                    ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                    bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                    bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                    Avoiding Communication in Iterative Linear Algebra

                    bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

                    bull Conjugate Gradients (CG) GMRES Lanczos Arnoldi hellip bull Goal minimize communication

                    ndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

                    bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

                    ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

                    bull Lots of speed up possible (modeled and measured)ndash Price some redundant computationndash Challenges Poor partitioning Preconditioning Num Stability

                    75

                    Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                    ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                    ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                    bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                    bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                    Example The Difficulty of Tuning SpMV

                    bull n = 21200bull nnz = 15 M

                    bull Source NASA structural analysis problem (raefsky)

                    77

                    Example The Difficulty of Tuning

                    bull n = 21200bull nnz = 15 M

                    bull Source NASA structural analysis problem (raefsky)

                    bull 8x8 dense substructure exploit this to limit mem_refs

                    78

                    Speedups on Itanium 2 The Need for Search

                    Reference

                    Best 4x2

                    Mflops

                    Mflops

                    79

                    Register Profile Itanium 2

                    190 Mflops

                    1190 Mflops

                    80

                    Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16

                    Itanium 2 - 33Itanium 1 - 8

                    252 Mflops

                    122 Mflops

                    820 Mflops

                    459 Mflops

                    247 Mflops

                    107 Mflops

                    12 Gflops

                    190 Mflops

                    Another example of tuning challenges for SpMV

                    bull Ex11 matrix (fluid flow)

                    bull More complicated non-zero structure in general

                    bull N = 16614bull NNZ = 11M

                    82

                    Zoom in to top corner

                    bull More complicated non-zero structure in general

                    bull N = 16614bull NNZ = 11M

                    83

                    3x3 blocks look natural buthellip

                    bull Example 3x3 blockingndash Logical grid of 3x3 cells

                    bull But would lead to lots of ldquofill-inrdquo

                    84

                    Extra Work Can Improve Efficiency

                    bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

                    bull On Pentium III 15x speedupndash Actual mflop rate 152 = 225 higher

                    85

                    Source Accelerator Cavity Design Problem (Ko via Husbands)

                    86

                    100x100 Submatrix Along Diagonal

                    Summer School Lecture 787

                    Post-RCM Reordering

                    88

                    Effect of Combined RCM+TSP Reordering

                    Before Green + RedAfter Green + Blue

                    Summer School Lecture 789

                    2x speedups on Pentium 4 Power 4 hellip

                    Summary of Other Performance Optimizations

                    bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

                    bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

                    bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

                    90

                    Optimized Sparse Kernel Interface - OSKI

                    bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

                    bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

                    bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

                    software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

                    91

                    Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                    ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                    ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                    bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                    bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                    93

                    Example Classical Conjugate Gradient (CG)

                    SpMVs and dot products require communication in

                    each iteration

                    via CA Matrix Powers Kernel

                    Global reduction to compute G

                    94

                    Example CA-Conjugate Gradient

                    Local computations within inner loop require

                    no communication

                    Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                    ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                    ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                    bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                    bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                    96

                    Slower convergence due

                    to roundoff

                    Loss of accuracy due to roundoff

                    At s = 16 monomial basis is rank deficient Method breaks down

                    Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

                    CA-CG (monomial)CG

                    machine precision

                    97

                    Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                    ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                    ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                    bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                    bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                    What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

                    bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

                    bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

                    matrices

                    Explicit (O(nnz)) Implicit (o(nnz))

                    Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

                    Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

                    Indices

                    Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                    ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                    ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                    bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                    bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                    101

                    bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

                    ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

                    demanded by customers (construction engineers) otherwise they donrsquot believe results

                    ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

                    ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

                    Reproducible Floating Point Computation

                    Absolute Error for Random Vectors

                    Same magnitude opposite signs

                    Intel MKL non-reproducibility

                    Relative Error for Orthogonal vectors

                    Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

                    Sign notreproducible

                    103

                    bull Consider summation or dot productbull Goals

                    1 Same answer independent of layout processors order of summands

                    2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

                    bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

                    GoalsApproaches for Reproducibility

                    104

                    Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

                    Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

                    Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

                    bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

                    Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

                    bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

                    Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

                    Instruments NEC Nokia NVIDIA Samsung Oracle

                    bull bebopcsberkeleyedu

                    Summary

                    Donrsquot Communichellip

                    106

                    Time to redesign all linear algebra n-body hellip algorithms and software

                    (and compilers)

                    • Implementing Communication-Avoiding Algorithms
                    • Why avoid communication
                    • Goals
                    • Outline
                    • Outline (2)
                    • Lower bound for all ldquon3-likerdquo linear algebra
                    • Lower bound for all ldquon3-likerdquo linear algebra (2)
                    • Lower bound for all ldquon3-likerdquo linear algebra (3)
                    • Limits to parallel scaling (12)
                    • Limits to parallel scaling (22)
                    • Can we attain these lower bounds
                    • Outline (3)
                    • 25D Matrix Multiplication
                    • 25D Matrix Multiplication (2)
                    • 25D Matmul on BGP 16K nodes 64K cores (2)
                    • Perfect Strong Scaling ndash in Time and Energy (12)
                    • Perfect Strong Scaling ndash in Time and Energy (22)
                    • Handling Heterogeneity
                    • Application to Tensor Contractions
                    • C(ijk) = Σm A(ijm)B(mk)
                    • Application to Tensor Contractions (2)
                    • Communication Lower Bounds for Strassen-like matmul algorithms
                    • vs
                    • Slide 26
                    • Strassen-like beyond matmul
                    • Cache and Network Oblivious Algorithms
                    • CARMA Performance Distributed Memory
                    • CARMA Performance Distributed Memory (2)
                    • CARMA Performance Shared Memory
                    • CARMA Performance Shared Memory (2)
                    • Why is CARMA Faster in Shared Memory
                    • Outline (4)
                    • One-sided Factorizations (LU QR) so far
                    • TSQR An Architecture-Dependent Algorithm
                    • Back to LU Using similar idea for TSLU as TSQR Use reduction
                    • Minimizing Communication in TSLU
                    • Making TSLU Numerically Stable
                    • Stability of LU using TSLU CALU
                    • Why is stability of TSLU just a ldquoThmrdquo
                    • Fixing TSLU
                    • 2D CALU with Tournament Pivoting
                    • 25D CALU with Tournament Pivoting (c=4 copies)
                    • Exascale Machine Parameters Source DOE Exascale Workshop
                    • Exascale predicted speedups for Gaussian Elimination 2D CA
                    • 25D vs 2D LU With and Without Pivoting
                    • Other CA algorithms for Ax=b least squares(13)
                    • Other CA algorithms for Ax=b least squares (23)
                    • Other CA algorithms for Ax=b least squares (33)
                    • Outline (5)
                    • What about sparse matrices (13)
                    • Performance of 25D APSP using Kleene
                    • What about sparse matrices (23)
                    • What about sparse matrices (33)
                    • Outline (6)
                    • Symmetric Eigenproblem and SVD
                    • Slide 58
                    • Slide 59
                    • Slide 60
                    • Slide 61
                    • Slide 62
                    • Slide 63
                    • Slide 64
                    • Slide 65
                    • Slide 66
                    • Slide 67
                    • Slide 68
                    • Conventional vs CA - SBR
                    • Speedups of Sym Band Reduction vs DSBTRD
                    • Nonsymmetric Eigenproblem
                    • Attaining the Lower bounds Sequential
                    • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
                    • Outline (7)
                    • Avoiding Communication in Iterative Linear Algebra
                    • Outline (8)
                    • Example The Difficulty of Tuning SpMV
                    • Example The Difficulty of Tuning
                    • Speedups on Itanium 2 The Need for Search
                    • Register Profile Itanium 2
                    • Register Profiles IBM and Intel IA-64
                    • Another example of tuning challenges for SpMV
                    • Zoom in to top corner
                    • 3x3 blocks look natural buthellip
                    • Extra Work Can Improve Efficiency
                    • Slide 86
                    • Slide 87
                    • Slide 88
                    • Slide 89
                    • Summary of Other Performance Optimizations
                    • Optimized Sparse Kernel Interface - OSKI
                    • Outline (9)
                    • Example Classical Conjugate Gradient (CG)
                    • Example CA-Conjugate Gradient
                    • Outline (10)
                    • Slide 96
                    • Slide 97
                    • Outline (11)
                    • What is a ldquosparse matrixrdquo
                    • Outline (12)
                    • Reproducible Floating Point Computation
                    • Intel MKL non-reproducibility
                    • GoalsApproaches for Reproducibility
                    • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
                    • Collaborators and Supporters
                    • Summary

                      Can we attain these lower bounds

                      bull Do conventional dense algorithms as implemented in LAPACK and ScaLAPACK attain these boundsndash Often not

                      bull If not are there other algorithms that dondash Yes for much of dense linear algebrandash New algorithms with new numerical properties

                      new ways to encode answers new data structures

                      ndash Not just loop transformations (need those too)bull Only a few sparse algorithms so farbull Lots of work in progress

                      ndash Algorithms Energy Heterogeneous Processors hellip11

                      Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                      ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                      ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                      bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                      bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                      25D Matrix Multiplication

                      bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid

                      c

                      (Pc)12

                      (Pc)12

                      Example P = 32 c = 2

                      25D Matrix Multiplication

                      bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid

                      k

                      j

                      iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12

                      (1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)

                      (3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)

                      25D Matmul on BGP 16K nodes 64K coresc = 16 copies

                      Distinguished Paper Award EuroParrsquo11 (Solomonik D)SCrsquo11 paper by Solomonik Bhatele D

                      12x faster

                      27x faster

                      Perfect Strong Scaling ndash in Time and Energy (12)

                      bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2

                      bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model

                      ndash γT βT αT = secs per flop per word_moved per message of size m

                      bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ] = T(P)cbull Notation for energy model

                      ndash γE βE αE = joules for same operations

                      ndash δE = joules per word of memory used per sec

                      ndash εE = joules per sec for leakage etc

                      bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP) = E(P)bull Perfect scaling extends to N-body Strassen hellip

                      Perfect Strong Scaling ndash in Time and Energy (22)

                      bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ] = T(P)cbull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP) = E(P)

                      bull Can use these formulas to answer many questions such asndash How to choose p and M to minimize energy E needed for computationndash Given max allowed runtime T what is minimum energy E needed to achieve itndash Given max allowed energy E what is the minimum runtime T attainablendash Can we minimize the average power P = ETndash Given target energy efficiency what architectural parameters are needed to

                      achieve itbull Can we attain 75 GflopsWattbull Can we attain an exaflop for 20 MWatts

                      Handling Heterogeneitybull Suppose each of P processors could differ

                      ndash γi = secflop βi = secword αi = secmessage Mi = memory

                      bull What is optimal assignment of work Fi to minimize timendash Ti = Fi γi + Fi βi Mi

                      12 + Fi αi Mi32 = Fi [γi + βi Mi

                      12 + αi Mi32] = Fi ξi

                      ndash Choose Fi so Σi Fi = n3 and minimizing T = maxi Ti

                      ndash Answer Fi = n3(1ξi)Σj(1ξj) and T = n3Σj(1ξj)

                      bull Optimal Algorithm for nxn matmulndash Recursively divide into 8 half-sized subproblemsndash Assign subproblems to processor i to add up to Fi flops

                      bull Works for Strassen other algorithmshellip

                      Application to Tensor Contractions

                      bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

                      bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

                      bull Heavily used in electronic structure calculationsndash Ex NWChem

                      bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetries

                      ndash Solomonik Hammond Matthews

                      C(ijk) = Σm A(ijm)B(mk)

                      A3-fold symm

                      B2-fold symm

                      C2-fold symm

                      Application to Tensor Contractions

                      bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

                      bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

                      bull Heavily used in electronic structure calculationsndash Ex NWChem for coupled cluster (CC) approach to Schroedinger eqn

                      bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetriesndash Up to 3x faster running CC than NWChem on 3072 cores of Cray XE6 ndash Solomonik Hammond Matthews

                      Communication Lower Bounds for Strassen-like matmul algorithms

                      bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

                      bull Extends up to M = n2 p2ω bull Extends to rectangular case multiply (mxn)(nxp) in q mults

                      ndash words_moved = Ω (flopsM^(logmpq -1))

                      bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz also in JACMbull Is the lower bound attainable

                      Classical O(n3) matmul

                      words_moved =Ω (M(nM12)3P)

                      Strassenrsquos O(nlg7) matmul

                      words_moved =Ω (M(nM12)lg7P)

                      Strassen-like O(nω) matmul

                      words_moved =Ω (M(nM12)ωP)

                      vs

                      Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

                      Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

                      CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

                      Communication Avoiding Parallel Strassen (CAPS)

                      Best way to interleaveBFS and DFS is an tuning parameter

                      26

                      Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

                      Speedups 24-184(over previous Strassen-based algorithms)

                      Invited to appear as Research Highlight in CACM

                      Strassen-like beyond matmul

                      bull Thm (D Dumitriu Holtzrsquo07) Any Strassen-like O(nω) matmul algorithm can be used to build a numerically stable O(nω+η) algorithm for any ηgt0 for Ax=b least squares eig SVD hellipndash ηgt0 needed to deal with numerical stabilityndash Strassen already stable so η=0

                      bull Thm For sequential versions of these algorithms Words_moved = O(nω+ηM(ω+η)2 ndash 1 + n2 log n) ie attain expected lower bound

                      Ballard D Holtz Schwartz

                      Cache and Network Oblivious Algorithms

                      bull Motivation Minimizes communication at every level of a hierarchical system without tuning parameters (in theory)ndash Not always 25D Matmul on BGP was topology aware

                      bull CAPS Divide-and-conquer choose BFS or DFS to adapt to processors available memory

                      bull CARMAndash Divide-and-conquer classical matmul divide largest of 3

                      dimensions to create two subproblemsndash Choose BFS or DFS to adapt to processors available memory

                      CARMA Performance Distributed Memory

                      Square m = k = n = 6144

                      ScaLAPACK

                      CARMA

                      Peak

                      (log)

                      (log)

                      Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA

                      CARMA Performance Distributed Memory

                      Inner Product m = n = 192 k = 6291456

                      ScaLAPACK

                      CARMAPeak

                      (log)

                      (log)

                      Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA

                      CARMA Performance Shared Memory

                      Square m = k = n

                      MKL (double)CARMA (double)

                      MKL (single)CARMA (single)

                      Peak (single)

                      Peak (double)

                      (log)

                      (linear)

                      Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

                      CARMA Performance Shared Memory

                      Inner Product m = n = 64

                      MKL (double)

                      CARMA (double)

                      MKL (single)

                      CARMA (single)

                      (log)

                      (linear)

                      Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

                      Why is CARMA Faster in Shared MemoryL3 Cache Misses

                      Shared Memory Inner Product (m = n = 64 k = 524288)

                      97 Fewer Misses

                      86 Fewer Misses

                      (linear)

                      Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                      ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                      ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                      bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                      bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                      One-sided Factorizations (LU QR) so farbull Classical Approach for i=1 to n update column i update trailing matrixbull words_moved = O(n3)

                      35

                      bull Blocked Approach (LAPACK) for i=1 to nb update block i of b columns update trailing matrixbull words moved = O(n3M13)

                      bull Recursive Approach func factor(A) if A has 1 column update it

                      else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)

                      bull None of these approaches minimizes messagesbull Parallel case Partial

                      Pivoting =gt n reductionsbull Need another idea

                      TSQR An Architecture-Dependent Algorithm

                      W =

                      W0

                      W1

                      W2

                      W3

                      R00

                      R10

                      R20

                      R30

                      R01

                      R11

                      R02Parallel

                      W =

                      W0

                      W1

                      W2

                      W3

                      R01 R02

                      R00

                      R03

                      SequentialStreaming

                      W =

                      W0

                      W1

                      W2

                      W3

                      R00

                      R01R01

                      R11

                      R02

                      R11

                      R03

                      Dual Core

                      Can choose reduction tree dynamicallyMulticore Multisocket Multirack Multisite Out-of-core

                      Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

                      Wnxb =

                      W1

                      W2

                      W3

                      W4

                      P1middotL1middotU1

                      P2middotL2middotU2

                      P3middotL3middotU3

                      P4middotL4middotU4

                      =

                      Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

                      W1rsquoW2rsquoW3rsquoW4rsquo

                      P12middotL12middotU12

                      P34middotL34middotU34

                      =Choose b pivot rows call them W12rsquo

                      Choose b pivot rows call them W34rsquo

                      W12rsquoW34rsquo

                      = P1234middotL1234middotU1234

                      Choose b pivot rows

                      Go back to W and use these b pivot rows (move them to top do LU without pivoting)

                      37

                      Minimizing Communication in TSLU

                      W = W1

                      W2

                      W3

                      W4

                      LULULULU

                      LU

                      LULUParallel

                      W = W1

                      W2

                      W3

                      W4

                      LULU

                      LU

                      LUSequentialStreaming

                      W = W1

                      W2

                      W3

                      W4

                      LULU LU

                      LULU

                      LULU

                      Dual Core

                      Can choose reduction tree dynamically to match architecture as before

                      38

                      Making TSLU Numerically Stable

                      bull Details matterndash Going up the tree we could do LU either on original rows of A

                      (tournament pivoting) or computed rows of Undash Only tournament pivoting stable

                      bull ldquoThmrdquo New scheme as stable as Partial Pivoting (GEPP) in following sense Get same Schur complements as GEPP applied to different input matrix whose entries are blocks taken from input A

                      bull Why just a ldquoThmrdquo

                      39

                      Stability of LU using TSLU CALU

                      Summer School Lecture 4 40

                      bull Empirical testingndash Both random matrices and ldquospecial onesrdquondash Both binary tree (BCALU) and flat-tree (FCALU)ndash 3 metrics ||PA-LU||||A|| normwise and componentwise backward errorsndash See [D Grigori Xiang 2010] for details

                      Why is stability of TSLU just a ldquoThmrdquo

                      bull Proof is correct ndash in exact arithmeticbull Experiment

                      ndash Generate 100 random 6x6 rank 3 matrices in Matlabndash [LUP] = lu(A) do LU without pivoting on PA compare L factors are

                      they the samebull Compute || L ndash Lnp || A few 0rsquos A few infinrsquos a few NaNsbull Rest mostly O(1)

                      ndash Why Floating point is nonassociative doing arithmetic in different order gives different rounding errors

                      ndash Same experiment with rank 6 matrices || L ndash Lnp || usually nonzero O(macheps)

                      ndash Same experiment with 20x20 rank 4 matrices || L ndash Lnp || often O(103)

                      bull Much harder to break TSLU but possiblendash Occurred when using TSLU to factorize a low-rank subdiagonal

                      panel in symmetric-indefinite factorization 41

                      Fixing TSLU

                      bull Run TSLU quickly test for stability fix if necessary (rare)

                      bull Test conditioning of U if not tiny (usual case) proceed elsebull Compute || L || if not big (usual case) proceed elsebull Factor A = QR using TSQR thenbull Factor Q = PLU using TSLU thenbull A = PL(UR) with UR as upper triangular factor

                      bull Last topic in lecture how to guarantee floating point reproducibility

                      42

                      2D CALU with Tournament Pivoting

                      43

                      25D CALU with Tournament Pivoting (c=4 copies)

                      44

                      Exascale Machine ParametersSource DOE Exascale Workshop

                      bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

                      Exascale predicted speedupsfor Gaussian Elimination

                      2D CA-LU vs ScaLAPACK-LU

                      log2 (p)

                      log 2

                      (n2 p

                      ) =

                      log 2

                      (mem

                      ory_

                      per_

                      proc

                      )

                      Up to 29x

                      25D vs 2D LUWith and Without Pivoting

                      Other CA algorithms for Ax=b least squares(13)

                      bull A symmetric and indefinitendash Seek factorization that retains symmetry PAPT = LDLT D

                      ldquosimplerdquobull Save frac12 flops preserve inertia

                      ndash Usual approach Bunch-Kaufmanbull D block diagonal with 1x1 and 2x2 blocksbull Pivot search down column along row (lots of communication)

                      ndash Alternative Aasenbull D = tridiagonal = Tbull Two steps

                      ndash PAPT = LTLT where T is banded using TSLU

                      48

                      0 0

                      0

                      0 0

                      0

                      0

                      hellip

                      hellip

                      ndash Solvefactor narrow band problem with Tbull Up to 28x faster than MKL Best Paper at IPDPSrsquo13

                      Other CA algorithms for Ax=b least squares (23)bull Minimizing bandwidth and latency for sequential GEPP

                      ndash So far could not do partial pivoting and minimize messages just words

                      ndash Challengebull Column layout good for choosing pivots bad for matmulbull Blocked layout good for matmul bad for choosing pivots

                      ndash Solution use both layouts switching between thembull ldquoShape Morphing LUrdquo or SMLU

                      49

                      bull func factor(A) if A has 1 column update it else factor(left half of A)

                      update right half of A

                      factor(right half of A)

                      bull Words = O(n3M12)

                      bull Messages = O(n3M)

                      bull func factor(A) if A has 1 column update it else factor(left half of A) reshape to recursive block format update right half of A reshape to columnwise format factor(right half of A)

                      bull Words = O(n3M12)

                      bull Messages = O(n3M32)

                      Other CA algorithms for Ax=b least squares (33)bull Need for pivoting arises beyond LU in QR

                      ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)

                      ndash Usual approach like Partial Pivotingbull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound

                      ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two

                      groups of b columns either using usual approach or something better (GuEisenstat)

                      bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix

                      ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 50

                      Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                      ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                      ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                      bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                      bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                      What about sparse matrices (13)

                      bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

                      bull But canrsquot reorder outer loop for 25D need another ideabull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = AB

                      ndash Dependencies ok 25D works just different semiringbull Kleenersquos Algorithm

                      52

                      for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

                      D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

                      Performance of 25D APSP using Kleene

                      53

                      Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

                      62xspeedup

                      2x speedup

                      What about sparse matrices (23)

                      bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of G(A)

                      have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

                      bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering for

                      2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

                      bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

                      multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

                      separators)

                      54

                      What about sparse matrices (33)

                      bull If matrix stays very sparse lower bound unattainable new one bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

                      data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

                      bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

                      Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

                      bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

                      along dimensions most likely to minimize cost55

                      Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                      ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                      ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                      bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                      bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                      Symmetric Eigenproblem and SVD

                      bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd

                      bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12

                      ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea

                      b+1

                      b+1

                      Successive Band Reduction (BischofLangSun)

                      1

                      b+1

                      b+1

                      d+1

                      c

                      Successive Band Reduction (BischofLangSun)

                      b = bandwidthc = columnsd = diagonalsConstraint c+d b

                      1Q1

                      b+1

                      b+1

                      d+1

                      c

                      b = bandwidthc = columnsd = diagonalsConstraint c+d b

                      Successive Band Reduction (BischofLangSun)

                      12

                      Q1

                      b+1

                      b+1

                      d+1

                      d+c

                      d+c

                      c

                      b = bandwidthc = columnsd = diagonalsConstraint c+d b

                      Successive Band Reduction (BischofLangSun)

                      1

                      12

                      Q1

                      Q1T

                      b+1

                      b+1

                      d+1

                      d+1

                      cd+c

                      d+c

                      c

                      b = bandwidthc = columnsd = diagonalsConstraint c+d b

                      Successive Band Reduction (BischofLangSun)

                      1

                      1

                      2

                      2Q1

                      Q1T

                      b+1

                      b+1

                      d+1

                      d+1

                      cd+c

                      d+c

                      d+c

                      d+c

                      c

                      b = bandwidthc = columnsd = diagonalsConstraint c+d b

                      Successive Band Reduction (BischofLangSun)

                      1

                      1

                      2

                      2

                      3

                      3

                      Q1

                      Q1T

                      Q2

                      Q2T

                      b+1

                      b+1

                      d+1

                      d+1

                      d+c

                      d+c

                      d+c

                      d+c

                      c

                      c

                      b = bandwidthc = columnsd = diagonalsConstraint c+d b

                      Successive Band Reduction (BischofLangSun)

                      1

                      1

                      2

                      2

                      3

                      3

                      4

                      4

                      Q1

                      Q1T

                      Q2

                      Q2T

                      Q3

                      Q3T

                      b+1

                      b+1

                      d+1

                      d+1

                      d+c

                      d+c

                      d+c

                      d+c

                      c

                      c

                      b = bandwidthc = columnsd = diagonalsConstraint c+d b

                      Successive Band Reduction (BischofLangSun)

                      1

                      1

                      2

                      2

                      3

                      3

                      4

                      4

                      5

                      5

                      Q1

                      Q1T

                      Q2

                      Q2T

                      Q3

                      Q3T

                      Q4

                      Q4T

                      b+1

                      b+1

                      d+1

                      d+1

                      c

                      c

                      d+c

                      d+c

                      d+c

                      d+c

                      b = bandwidthc = columnsd = diagonalsConstraint c+d b

                      Successive Band Reduction (BischofLangSun)

                      1

                      1

                      2

                      2

                      3

                      3

                      4

                      4

                      5

                      5

                      Q5T

                      Q1

                      Q1T

                      Q2

                      Q2T

                      Q3

                      Q3T

                      Q5

                      Q4

                      Q4T

                      b+1

                      b+1

                      d+1

                      d+1

                      c

                      c

                      d+c

                      d+c

                      d+c

                      d+c

                      b = bandwidthc = columnsd = diagonalsConstraint c+d b

                      Successive Band Reduction (BischofLangSun)

                      1

                      1

                      2

                      2

                      3

                      3

                      4

                      4

                      5

                      5

                      6

                      6

                      Q5T

                      Q1

                      Q1T

                      Q2

                      Q2T

                      Q3

                      Q3T

                      Q5

                      Q4

                      Q4T

                      b+1

                      b+1

                      d+1

                      d+1

                      c

                      c

                      d+c

                      d+c

                      d+c

                      d+c

                      b = bandwidthc = columnsd = diagonalsConstraint c+d b

                      Successive Band Reduction (BischofLangSun)

                      Conventional vs CA - SBR

                      Conventional Communication-Avoiding

                      Touch all data 4 times Touch all data once

                      >
                      >

                      Speedups of Sym Band Reductionvs DSBTRD

                      bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

                      bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

                      bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

                      bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

                      bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

                      Nonsymmetric Eigenproblem

                      bull No apparent way to modify standard algorithmbull Instead Spectral Divide-and-Conquer

                      ndash Find orthogonal matrix Q whose leading columns span an invariant subspace of A

                      ndash QTAQ will be block upper triangularndash Apply recursively to A11 A22

                      ndash Depends on randomization1 Randomized Rank Revealing QR decomposition2 Randomized location to try splitting spectrum

                      A11 A12

                      ε A22

                      Attaining the Lower bounds SequentialLegend[Existing][Ours][Math-Lib][Random]

                      Two Levels Memory Hierarchy

                      Words Messages Words Messages

                      BLAS-3 [FLPRrsquo99][BDLSTrsquo13][MKL etc] [FLPRrsquo99][BDLSTrsquo13][MKL etc]

                      Cholesky[Grsquo97][APrsquo00]

                      [LAPACK][BDHSrsquo09]

                      [Grsquo97][APrsquo00][BDHSrsquo09] [Grsquo97][APrsquo00][BDHSrsquo09]

                      Sym Indefinite [BBDDDPSTYrsquo13] [BBDDDPSTYrsquo13]

                      LU[Grsquo97][Trsquo97]

                      [GDXrsquo11][BDLSTrsquo13]

                      [GDXrsquo11][BDLSTrsquo13]

                      [Grsquo97][Trsquo97] [BDLSTrsquo13] [BDLSTrsquo13]

                      QR[EGrsquo98][FWrsquo03]

                      [DGHLrsquo12][BDLSTrsquo13]

                      [FWrsquo03][DGHLrsquo12][BDLSTrsquo13]

                      [EGrsquo98][FWrsquo03][BDLSTrsquo13]

                      [FWrsquo03][BDLSTrsquo13]

                      Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

                      Sym Eig amp SVD [BDDrsquo11][BDKrsquo13] [BDDrsquo11]

                      Non Sym Eig [BDDrsquo11] [BDDrsquo11]

                      Legend[Existing][Ours][Math-Lib][Random]

                      Words (BW) Messages (L) Saving factor

                      BLAS-3 [AGZrsquo94][MTrsquo99][ScaLAPACK][Crsquo69][vGWrsquo97][SDrsquo11] L nP12

                      Cholesky [ScaLAPACK][Trsquo99][SDrsquo11] L nP12

                      Sym Indefinite [BBDDDPSTYrsquo13][ScaLAPACK] [BBDDDPSTYrsquo13] L nP12

                      LU [ScaLAPACK][GDXrsquo11][Trsquo99][SDrsquo11] [GDXrsquo11][Trsquo99][SDrsquo11] L nP12

                      QR [ScaLAPACK][DGHLrsquo12] [Trsquo99] [DGHLrsquo12][Trsquo99] L nP12

                      Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

                      Sym Eig amp SVD [BDDrsquo11][BDKrsquo13][ScaLAPACK] [BDDrsquo11][BDKrsquo13] L nP12

                      Non-Sym Eig [BDDrsquo11] [BDDrsquo11] BW P12 L n

                      Attaining with extra memory 25D M=(cn2P)

                      Attaining the Lower bounds Parallel 2DM=(n2P)(Ignoring poly-log(P) factors words = ( n2 P12) messages = (P12)

                      Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                      ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                      ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                      bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                      bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                      Avoiding Communication in Iterative Linear Algebra

                      bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

                      bull Conjugate Gradients (CG) GMRES Lanczos Arnoldi hellip bull Goal minimize communication

                      ndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

                      bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

                      ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

                      bull Lots of speed up possible (modeled and measured)ndash Price some redundant computationndash Challenges Poor partitioning Preconditioning Num Stability

                      75

                      Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                      ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                      ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                      bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                      bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                      Example The Difficulty of Tuning SpMV

                      bull n = 21200bull nnz = 15 M

                      bull Source NASA structural analysis problem (raefsky)

                      77

                      Example The Difficulty of Tuning

                      bull n = 21200bull nnz = 15 M

                      bull Source NASA structural analysis problem (raefsky)

                      bull 8x8 dense substructure exploit this to limit mem_refs

                      78

                      Speedups on Itanium 2 The Need for Search

                      Reference

                      Best 4x2

                      Mflops

                      Mflops

                      79

                      Register Profile Itanium 2

                      190 Mflops

                      1190 Mflops

                      80

                      Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16

                      Itanium 2 - 33Itanium 1 - 8

                      252 Mflops

                      122 Mflops

                      820 Mflops

                      459 Mflops

                      247 Mflops

                      107 Mflops

                      12 Gflops

                      190 Mflops

                      Another example of tuning challenges for SpMV

                      bull Ex11 matrix (fluid flow)

                      bull More complicated non-zero structure in general

                      bull N = 16614bull NNZ = 11M

                      82

                      Zoom in to top corner

                      bull More complicated non-zero structure in general

                      bull N = 16614bull NNZ = 11M

                      83

                      3x3 blocks look natural buthellip

                      bull Example 3x3 blockingndash Logical grid of 3x3 cells

                      bull But would lead to lots of ldquofill-inrdquo

                      84

                      Extra Work Can Improve Efficiency

                      bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

                      bull On Pentium III 15x speedupndash Actual mflop rate 152 = 225 higher

                      85

                      Source Accelerator Cavity Design Problem (Ko via Husbands)

                      86

                      100x100 Submatrix Along Diagonal

                      Summer School Lecture 787

                      Post-RCM Reordering

                      88

                      Effect of Combined RCM+TSP Reordering

                      Before Green + RedAfter Green + Blue

                      Summer School Lecture 789

                      2x speedups on Pentium 4 Power 4 hellip

                      Summary of Other Performance Optimizations

                      bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

                      bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

                      bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

                      90

                      Optimized Sparse Kernel Interface - OSKI

                      bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

                      bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

                      bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

                      software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

                      91

                      Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                      ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                      ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                      bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                      bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                      93

                      Example Classical Conjugate Gradient (CG)

                      SpMVs and dot products require communication in

                      each iteration

                      via CA Matrix Powers Kernel

                      Global reduction to compute G

                      94

                      Example CA-Conjugate Gradient

                      Local computations within inner loop require

                      no communication

                      Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                      ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                      ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                      bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                      bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                      96

                      Slower convergence due

                      to roundoff

                      Loss of accuracy due to roundoff

                      At s = 16 monomial basis is rank deficient Method breaks down

                      Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

                      CA-CG (monomial)CG

                      machine precision

                      97

                      Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                      ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                      ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                      bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                      bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                      What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

                      bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

                      bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

                      matrices

                      Explicit (O(nnz)) Implicit (o(nnz))

                      Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

                      Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

                      Indices

                      Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                      ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                      ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                      bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                      bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                      101

                      bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

                      ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

                      demanded by customers (construction engineers) otherwise they donrsquot believe results

                      ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

                      ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

                      Reproducible Floating Point Computation

                      Absolute Error for Random Vectors

                      Same magnitude opposite signs

                      Intel MKL non-reproducibility

                      Relative Error for Orthogonal vectors

                      Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

                      Sign notreproducible

                      103

                      bull Consider summation or dot productbull Goals

                      1 Same answer independent of layout processors order of summands

                      2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

                      bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

                      GoalsApproaches for Reproducibility

                      104

                      Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

                      Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

                      Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

                      bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

                      Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

                      bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

                      Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

                      Instruments NEC Nokia NVIDIA Samsung Oracle

                      bull bebopcsberkeleyedu

                      Summary

                      Donrsquot Communichellip

                      106

                      Time to redesign all linear algebra n-body hellip algorithms and software

                      (and compilers)

                      • Implementing Communication-Avoiding Algorithms
                      • Why avoid communication
                      • Goals
                      • Outline
                      • Outline (2)
                      • Lower bound for all ldquon3-likerdquo linear algebra
                      • Lower bound for all ldquon3-likerdquo linear algebra (2)
                      • Lower bound for all ldquon3-likerdquo linear algebra (3)
                      • Limits to parallel scaling (12)
                      • Limits to parallel scaling (22)
                      • Can we attain these lower bounds
                      • Outline (3)
                      • 25D Matrix Multiplication
                      • 25D Matrix Multiplication (2)
                      • 25D Matmul on BGP 16K nodes 64K cores (2)
                      • Perfect Strong Scaling ndash in Time and Energy (12)
                      • Perfect Strong Scaling ndash in Time and Energy (22)
                      • Handling Heterogeneity
                      • Application to Tensor Contractions
                      • C(ijk) = Σm A(ijm)B(mk)
                      • Application to Tensor Contractions (2)
                      • Communication Lower Bounds for Strassen-like matmul algorithms
                      • vs
                      • Slide 26
                      • Strassen-like beyond matmul
                      • Cache and Network Oblivious Algorithms
                      • CARMA Performance Distributed Memory
                      • CARMA Performance Distributed Memory (2)
                      • CARMA Performance Shared Memory
                      • CARMA Performance Shared Memory (2)
                      • Why is CARMA Faster in Shared Memory
                      • Outline (4)
                      • One-sided Factorizations (LU QR) so far
                      • TSQR An Architecture-Dependent Algorithm
                      • Back to LU Using similar idea for TSLU as TSQR Use reduction
                      • Minimizing Communication in TSLU
                      • Making TSLU Numerically Stable
                      • Stability of LU using TSLU CALU
                      • Why is stability of TSLU just a ldquoThmrdquo
                      • Fixing TSLU
                      • 2D CALU with Tournament Pivoting
                      • 25D CALU with Tournament Pivoting (c=4 copies)
                      • Exascale Machine Parameters Source DOE Exascale Workshop
                      • Exascale predicted speedups for Gaussian Elimination 2D CA
                      • 25D vs 2D LU With and Without Pivoting
                      • Other CA algorithms for Ax=b least squares(13)
                      • Other CA algorithms for Ax=b least squares (23)
                      • Other CA algorithms for Ax=b least squares (33)
                      • Outline (5)
                      • What about sparse matrices (13)
                      • Performance of 25D APSP using Kleene
                      • What about sparse matrices (23)
                      • What about sparse matrices (33)
                      • Outline (6)
                      • Symmetric Eigenproblem and SVD
                      • Slide 58
                      • Slide 59
                      • Slide 60
                      • Slide 61
                      • Slide 62
                      • Slide 63
                      • Slide 64
                      • Slide 65
                      • Slide 66
                      • Slide 67
                      • Slide 68
                      • Conventional vs CA - SBR
                      • Speedups of Sym Band Reduction vs DSBTRD
                      • Nonsymmetric Eigenproblem
                      • Attaining the Lower bounds Sequential
                      • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
                      • Outline (7)
                      • Avoiding Communication in Iterative Linear Algebra
                      • Outline (8)
                      • Example The Difficulty of Tuning SpMV
                      • Example The Difficulty of Tuning
                      • Speedups on Itanium 2 The Need for Search
                      • Register Profile Itanium 2
                      • Register Profiles IBM and Intel IA-64
                      • Another example of tuning challenges for SpMV
                      • Zoom in to top corner
                      • 3x3 blocks look natural buthellip
                      • Extra Work Can Improve Efficiency
                      • Slide 86
                      • Slide 87
                      • Slide 88
                      • Slide 89
                      • Summary of Other Performance Optimizations
                      • Optimized Sparse Kernel Interface - OSKI
                      • Outline (9)
                      • Example Classical Conjugate Gradient (CG)
                      • Example CA-Conjugate Gradient
                      • Outline (10)
                      • Slide 96
                      • Slide 97
                      • Outline (11)
                      • What is a ldquosparse matrixrdquo
                      • Outline (12)
                      • Reproducible Floating Point Computation
                      • Intel MKL non-reproducibility
                      • GoalsApproaches for Reproducibility
                      • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
                      • Collaborators and Supporters
                      • Summary

                        Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                        ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                        ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                        bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                        bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                        25D Matrix Multiplication

                        bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid

                        c

                        (Pc)12

                        (Pc)12

                        Example P = 32 c = 2

                        25D Matrix Multiplication

                        bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid

                        k

                        j

                        iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12

                        (1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)

                        (3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)

                        25D Matmul on BGP 16K nodes 64K coresc = 16 copies

                        Distinguished Paper Award EuroParrsquo11 (Solomonik D)SCrsquo11 paper by Solomonik Bhatele D

                        12x faster

                        27x faster

                        Perfect Strong Scaling ndash in Time and Energy (12)

                        bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2

                        bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model

                        ndash γT βT αT = secs per flop per word_moved per message of size m

                        bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ] = T(P)cbull Notation for energy model

                        ndash γE βE αE = joules for same operations

                        ndash δE = joules per word of memory used per sec

                        ndash εE = joules per sec for leakage etc

                        bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP) = E(P)bull Perfect scaling extends to N-body Strassen hellip

                        Perfect Strong Scaling ndash in Time and Energy (22)

                        bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ] = T(P)cbull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP) = E(P)

                        bull Can use these formulas to answer many questions such asndash How to choose p and M to minimize energy E needed for computationndash Given max allowed runtime T what is minimum energy E needed to achieve itndash Given max allowed energy E what is the minimum runtime T attainablendash Can we minimize the average power P = ETndash Given target energy efficiency what architectural parameters are needed to

                        achieve itbull Can we attain 75 GflopsWattbull Can we attain an exaflop for 20 MWatts

                        Handling Heterogeneitybull Suppose each of P processors could differ

                        ndash γi = secflop βi = secword αi = secmessage Mi = memory

                        bull What is optimal assignment of work Fi to minimize timendash Ti = Fi γi + Fi βi Mi

                        12 + Fi αi Mi32 = Fi [γi + βi Mi

                        12 + αi Mi32] = Fi ξi

                        ndash Choose Fi so Σi Fi = n3 and minimizing T = maxi Ti

                        ndash Answer Fi = n3(1ξi)Σj(1ξj) and T = n3Σj(1ξj)

                        bull Optimal Algorithm for nxn matmulndash Recursively divide into 8 half-sized subproblemsndash Assign subproblems to processor i to add up to Fi flops

                        bull Works for Strassen other algorithmshellip

                        Application to Tensor Contractions

                        bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

                        bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

                        bull Heavily used in electronic structure calculationsndash Ex NWChem

                        bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetries

                        ndash Solomonik Hammond Matthews

                        C(ijk) = Σm A(ijm)B(mk)

                        A3-fold symm

                        B2-fold symm

                        C2-fold symm

                        Application to Tensor Contractions

                        bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

                        bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

                        bull Heavily used in electronic structure calculationsndash Ex NWChem for coupled cluster (CC) approach to Schroedinger eqn

                        bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetriesndash Up to 3x faster running CC than NWChem on 3072 cores of Cray XE6 ndash Solomonik Hammond Matthews

                        Communication Lower Bounds for Strassen-like matmul algorithms

                        bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

                        bull Extends up to M = n2 p2ω bull Extends to rectangular case multiply (mxn)(nxp) in q mults

                        ndash words_moved = Ω (flopsM^(logmpq -1))

                        bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz also in JACMbull Is the lower bound attainable

                        Classical O(n3) matmul

                        words_moved =Ω (M(nM12)3P)

                        Strassenrsquos O(nlg7) matmul

                        words_moved =Ω (M(nM12)lg7P)

                        Strassen-like O(nω) matmul

                        words_moved =Ω (M(nM12)ωP)

                        vs

                        Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

                        Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

                        CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

                        Communication Avoiding Parallel Strassen (CAPS)

                        Best way to interleaveBFS and DFS is an tuning parameter

                        26

                        Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

                        Speedups 24-184(over previous Strassen-based algorithms)

                        Invited to appear as Research Highlight in CACM

                        Strassen-like beyond matmul

                        bull Thm (D Dumitriu Holtzrsquo07) Any Strassen-like O(nω) matmul algorithm can be used to build a numerically stable O(nω+η) algorithm for any ηgt0 for Ax=b least squares eig SVD hellipndash ηgt0 needed to deal with numerical stabilityndash Strassen already stable so η=0

                        bull Thm For sequential versions of these algorithms Words_moved = O(nω+ηM(ω+η)2 ndash 1 + n2 log n) ie attain expected lower bound

                        Ballard D Holtz Schwartz

                        Cache and Network Oblivious Algorithms

                        bull Motivation Minimizes communication at every level of a hierarchical system without tuning parameters (in theory)ndash Not always 25D Matmul on BGP was topology aware

                        bull CAPS Divide-and-conquer choose BFS or DFS to adapt to processors available memory

                        bull CARMAndash Divide-and-conquer classical matmul divide largest of 3

                        dimensions to create two subproblemsndash Choose BFS or DFS to adapt to processors available memory

                        CARMA Performance Distributed Memory

                        Square m = k = n = 6144

                        ScaLAPACK

                        CARMA

                        Peak

                        (log)

                        (log)

                        Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA

                        CARMA Performance Distributed Memory

                        Inner Product m = n = 192 k = 6291456

                        ScaLAPACK

                        CARMAPeak

                        (log)

                        (log)

                        Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA

                        CARMA Performance Shared Memory

                        Square m = k = n

                        MKL (double)CARMA (double)

                        MKL (single)CARMA (single)

                        Peak (single)

                        Peak (double)

                        (log)

                        (linear)

                        Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

                        CARMA Performance Shared Memory

                        Inner Product m = n = 64

                        MKL (double)

                        CARMA (double)

                        MKL (single)

                        CARMA (single)

                        (log)

                        (linear)

                        Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

                        Why is CARMA Faster in Shared MemoryL3 Cache Misses

                        Shared Memory Inner Product (m = n = 64 k = 524288)

                        97 Fewer Misses

                        86 Fewer Misses

                        (linear)

                        Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                        ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                        ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                        bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                        bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                        One-sided Factorizations (LU QR) so farbull Classical Approach for i=1 to n update column i update trailing matrixbull words_moved = O(n3)

                        35

                        bull Blocked Approach (LAPACK) for i=1 to nb update block i of b columns update trailing matrixbull words moved = O(n3M13)

                        bull Recursive Approach func factor(A) if A has 1 column update it

                        else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)

                        bull None of these approaches minimizes messagesbull Parallel case Partial

                        Pivoting =gt n reductionsbull Need another idea

                        TSQR An Architecture-Dependent Algorithm

                        W =

                        W0

                        W1

                        W2

                        W3

                        R00

                        R10

                        R20

                        R30

                        R01

                        R11

                        R02Parallel

                        W =

                        W0

                        W1

                        W2

                        W3

                        R01 R02

                        R00

                        R03

                        SequentialStreaming

                        W =

                        W0

                        W1

                        W2

                        W3

                        R00

                        R01R01

                        R11

                        R02

                        R11

                        R03

                        Dual Core

                        Can choose reduction tree dynamicallyMulticore Multisocket Multirack Multisite Out-of-core

                        Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

                        Wnxb =

                        W1

                        W2

                        W3

                        W4

                        P1middotL1middotU1

                        P2middotL2middotU2

                        P3middotL3middotU3

                        P4middotL4middotU4

                        =

                        Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

                        W1rsquoW2rsquoW3rsquoW4rsquo

                        P12middotL12middotU12

                        P34middotL34middotU34

                        =Choose b pivot rows call them W12rsquo

                        Choose b pivot rows call them W34rsquo

                        W12rsquoW34rsquo

                        = P1234middotL1234middotU1234

                        Choose b pivot rows

                        Go back to W and use these b pivot rows (move them to top do LU without pivoting)

                        37

                        Minimizing Communication in TSLU

                        W = W1

                        W2

                        W3

                        W4

                        LULULULU

                        LU

                        LULUParallel

                        W = W1

                        W2

                        W3

                        W4

                        LULU

                        LU

                        LUSequentialStreaming

                        W = W1

                        W2

                        W3

                        W4

                        LULU LU

                        LULU

                        LULU

                        Dual Core

                        Can choose reduction tree dynamically to match architecture as before

                        38

                        Making TSLU Numerically Stable

                        bull Details matterndash Going up the tree we could do LU either on original rows of A

                        (tournament pivoting) or computed rows of Undash Only tournament pivoting stable

                        bull ldquoThmrdquo New scheme as stable as Partial Pivoting (GEPP) in following sense Get same Schur complements as GEPP applied to different input matrix whose entries are blocks taken from input A

                        bull Why just a ldquoThmrdquo

                        39

                        Stability of LU using TSLU CALU

                        Summer School Lecture 4 40

                        bull Empirical testingndash Both random matrices and ldquospecial onesrdquondash Both binary tree (BCALU) and flat-tree (FCALU)ndash 3 metrics ||PA-LU||||A|| normwise and componentwise backward errorsndash See [D Grigori Xiang 2010] for details

                        Why is stability of TSLU just a ldquoThmrdquo

                        bull Proof is correct ndash in exact arithmeticbull Experiment

                        ndash Generate 100 random 6x6 rank 3 matrices in Matlabndash [LUP] = lu(A) do LU without pivoting on PA compare L factors are

                        they the samebull Compute || L ndash Lnp || A few 0rsquos A few infinrsquos a few NaNsbull Rest mostly O(1)

                        ndash Why Floating point is nonassociative doing arithmetic in different order gives different rounding errors

                        ndash Same experiment with rank 6 matrices || L ndash Lnp || usually nonzero O(macheps)

                        ndash Same experiment with 20x20 rank 4 matrices || L ndash Lnp || often O(103)

                        bull Much harder to break TSLU but possiblendash Occurred when using TSLU to factorize a low-rank subdiagonal

                        panel in symmetric-indefinite factorization 41

                        Fixing TSLU

                        bull Run TSLU quickly test for stability fix if necessary (rare)

                        bull Test conditioning of U if not tiny (usual case) proceed elsebull Compute || L || if not big (usual case) proceed elsebull Factor A = QR using TSQR thenbull Factor Q = PLU using TSLU thenbull A = PL(UR) with UR as upper triangular factor

                        bull Last topic in lecture how to guarantee floating point reproducibility

                        42

                        2D CALU with Tournament Pivoting

                        43

                        25D CALU with Tournament Pivoting (c=4 copies)

                        44

                        Exascale Machine ParametersSource DOE Exascale Workshop

                        bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

                        Exascale predicted speedupsfor Gaussian Elimination

                        2D CA-LU vs ScaLAPACK-LU

                        log2 (p)

                        log 2

                        (n2 p

                        ) =

                        log 2

                        (mem

                        ory_

                        per_

                        proc

                        )

                        Up to 29x

                        25D vs 2D LUWith and Without Pivoting

                        Other CA algorithms for Ax=b least squares(13)

                        bull A symmetric and indefinitendash Seek factorization that retains symmetry PAPT = LDLT D

                        ldquosimplerdquobull Save frac12 flops preserve inertia

                        ndash Usual approach Bunch-Kaufmanbull D block diagonal with 1x1 and 2x2 blocksbull Pivot search down column along row (lots of communication)

                        ndash Alternative Aasenbull D = tridiagonal = Tbull Two steps

                        ndash PAPT = LTLT where T is banded using TSLU

                        48

                        0 0

                        0

                        0 0

                        0

                        0

                        hellip

                        hellip

                        ndash Solvefactor narrow band problem with Tbull Up to 28x faster than MKL Best Paper at IPDPSrsquo13

                        Other CA algorithms for Ax=b least squares (23)bull Minimizing bandwidth and latency for sequential GEPP

                        ndash So far could not do partial pivoting and minimize messages just words

                        ndash Challengebull Column layout good for choosing pivots bad for matmulbull Blocked layout good for matmul bad for choosing pivots

                        ndash Solution use both layouts switching between thembull ldquoShape Morphing LUrdquo or SMLU

                        49

                        bull func factor(A) if A has 1 column update it else factor(left half of A)

                        update right half of A

                        factor(right half of A)

                        bull Words = O(n3M12)

                        bull Messages = O(n3M)

                        bull func factor(A) if A has 1 column update it else factor(left half of A) reshape to recursive block format update right half of A reshape to columnwise format factor(right half of A)

                        bull Words = O(n3M12)

                        bull Messages = O(n3M32)

                        Other CA algorithms for Ax=b least squares (33)bull Need for pivoting arises beyond LU in QR

                        ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)

                        ndash Usual approach like Partial Pivotingbull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound

                        ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two

                        groups of b columns either using usual approach or something better (GuEisenstat)

                        bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix

                        ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 50

                        Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                        ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                        ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                        bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                        bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                        What about sparse matrices (13)

                        bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

                        bull But canrsquot reorder outer loop for 25D need another ideabull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = AB

                        ndash Dependencies ok 25D works just different semiringbull Kleenersquos Algorithm

                        52

                        for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

                        D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

                        Performance of 25D APSP using Kleene

                        53

                        Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

                        62xspeedup

                        2x speedup

                        What about sparse matrices (23)

                        bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of G(A)

                        have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

                        bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering for

                        2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

                        bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

                        multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

                        separators)

                        54

                        What about sparse matrices (33)

                        bull If matrix stays very sparse lower bound unattainable new one bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

                        data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

                        bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

                        Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

                        bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

                        along dimensions most likely to minimize cost55

                        Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                        ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                        ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                        bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                        bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                        Symmetric Eigenproblem and SVD

                        bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd

                        bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12

                        ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea

                        b+1

                        b+1

                        Successive Band Reduction (BischofLangSun)

                        1

                        b+1

                        b+1

                        d+1

                        c

                        Successive Band Reduction (BischofLangSun)

                        b = bandwidthc = columnsd = diagonalsConstraint c+d b

                        1Q1

                        b+1

                        b+1

                        d+1

                        c

                        b = bandwidthc = columnsd = diagonalsConstraint c+d b

                        Successive Band Reduction (BischofLangSun)

                        12

                        Q1

                        b+1

                        b+1

                        d+1

                        d+c

                        d+c

                        c

                        b = bandwidthc = columnsd = diagonalsConstraint c+d b

                        Successive Band Reduction (BischofLangSun)

                        1

                        12

                        Q1

                        Q1T

                        b+1

                        b+1

                        d+1

                        d+1

                        cd+c

                        d+c

                        c

                        b = bandwidthc = columnsd = diagonalsConstraint c+d b

                        Successive Band Reduction (BischofLangSun)

                        1

                        1

                        2

                        2Q1

                        Q1T

                        b+1

                        b+1

                        d+1

                        d+1

                        cd+c

                        d+c

                        d+c

                        d+c

                        c

                        b = bandwidthc = columnsd = diagonalsConstraint c+d b

                        Successive Band Reduction (BischofLangSun)

                        1

                        1

                        2

                        2

                        3

                        3

                        Q1

                        Q1T

                        Q2

                        Q2T

                        b+1

                        b+1

                        d+1

                        d+1

                        d+c

                        d+c

                        d+c

                        d+c

                        c

                        c

                        b = bandwidthc = columnsd = diagonalsConstraint c+d b

                        Successive Band Reduction (BischofLangSun)

                        1

                        1

                        2

                        2

                        3

                        3

                        4

                        4

                        Q1

                        Q1T

                        Q2

                        Q2T

                        Q3

                        Q3T

                        b+1

                        b+1

                        d+1

                        d+1

                        d+c

                        d+c

                        d+c

                        d+c

                        c

                        c

                        b = bandwidthc = columnsd = diagonalsConstraint c+d b

                        Successive Band Reduction (BischofLangSun)

                        1

                        1

                        2

                        2

                        3

                        3

                        4

                        4

                        5

                        5

                        Q1

                        Q1T

                        Q2

                        Q2T

                        Q3

                        Q3T

                        Q4

                        Q4T

                        b+1

                        b+1

                        d+1

                        d+1

                        c

                        c

                        d+c

                        d+c

                        d+c

                        d+c

                        b = bandwidthc = columnsd = diagonalsConstraint c+d b

                        Successive Band Reduction (BischofLangSun)

                        1

                        1

                        2

                        2

                        3

                        3

                        4

                        4

                        5

                        5

                        Q5T

                        Q1

                        Q1T

                        Q2

                        Q2T

                        Q3

                        Q3T

                        Q5

                        Q4

                        Q4T

                        b+1

                        b+1

                        d+1

                        d+1

                        c

                        c

                        d+c

                        d+c

                        d+c

                        d+c

                        b = bandwidthc = columnsd = diagonalsConstraint c+d b

                        Successive Band Reduction (BischofLangSun)

                        1

                        1

                        2

                        2

                        3

                        3

                        4

                        4

                        5

                        5

                        6

                        6

                        Q5T

                        Q1

                        Q1T

                        Q2

                        Q2T

                        Q3

                        Q3T

                        Q5

                        Q4

                        Q4T

                        b+1

                        b+1

                        d+1

                        d+1

                        c

                        c

                        d+c

                        d+c

                        d+c

                        d+c

                        b = bandwidthc = columnsd = diagonalsConstraint c+d b

                        Successive Band Reduction (BischofLangSun)

                        Conventional vs CA - SBR

                        Conventional Communication-Avoiding

                        Touch all data 4 times Touch all data once

                        >
                        >

                        Speedups of Sym Band Reductionvs DSBTRD

                        bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

                        bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

                        bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

                        bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

                        bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

                        Nonsymmetric Eigenproblem

                        bull No apparent way to modify standard algorithmbull Instead Spectral Divide-and-Conquer

                        ndash Find orthogonal matrix Q whose leading columns span an invariant subspace of A

                        ndash QTAQ will be block upper triangularndash Apply recursively to A11 A22

                        ndash Depends on randomization1 Randomized Rank Revealing QR decomposition2 Randomized location to try splitting spectrum

                        A11 A12

                        ε A22

                        Attaining the Lower bounds SequentialLegend[Existing][Ours][Math-Lib][Random]

                        Two Levels Memory Hierarchy

                        Words Messages Words Messages

                        BLAS-3 [FLPRrsquo99][BDLSTrsquo13][MKL etc] [FLPRrsquo99][BDLSTrsquo13][MKL etc]

                        Cholesky[Grsquo97][APrsquo00]

                        [LAPACK][BDHSrsquo09]

                        [Grsquo97][APrsquo00][BDHSrsquo09] [Grsquo97][APrsquo00][BDHSrsquo09]

                        Sym Indefinite [BBDDDPSTYrsquo13] [BBDDDPSTYrsquo13]

                        LU[Grsquo97][Trsquo97]

                        [GDXrsquo11][BDLSTrsquo13]

                        [GDXrsquo11][BDLSTrsquo13]

                        [Grsquo97][Trsquo97] [BDLSTrsquo13] [BDLSTrsquo13]

                        QR[EGrsquo98][FWrsquo03]

                        [DGHLrsquo12][BDLSTrsquo13]

                        [FWrsquo03][DGHLrsquo12][BDLSTrsquo13]

                        [EGrsquo98][FWrsquo03][BDLSTrsquo13]

                        [FWrsquo03][BDLSTrsquo13]

                        Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

                        Sym Eig amp SVD [BDDrsquo11][BDKrsquo13] [BDDrsquo11]

                        Non Sym Eig [BDDrsquo11] [BDDrsquo11]

                        Legend[Existing][Ours][Math-Lib][Random]

                        Words (BW) Messages (L) Saving factor

                        BLAS-3 [AGZrsquo94][MTrsquo99][ScaLAPACK][Crsquo69][vGWrsquo97][SDrsquo11] L nP12

                        Cholesky [ScaLAPACK][Trsquo99][SDrsquo11] L nP12

                        Sym Indefinite [BBDDDPSTYrsquo13][ScaLAPACK] [BBDDDPSTYrsquo13] L nP12

                        LU [ScaLAPACK][GDXrsquo11][Trsquo99][SDrsquo11] [GDXrsquo11][Trsquo99][SDrsquo11] L nP12

                        QR [ScaLAPACK][DGHLrsquo12] [Trsquo99] [DGHLrsquo12][Trsquo99] L nP12

                        Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

                        Sym Eig amp SVD [BDDrsquo11][BDKrsquo13][ScaLAPACK] [BDDrsquo11][BDKrsquo13] L nP12

                        Non-Sym Eig [BDDrsquo11] [BDDrsquo11] BW P12 L n

                        Attaining with extra memory 25D M=(cn2P)

                        Attaining the Lower bounds Parallel 2DM=(n2P)(Ignoring poly-log(P) factors words = ( n2 P12) messages = (P12)

                        Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                        ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                        ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                        bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                        bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                        Avoiding Communication in Iterative Linear Algebra

                        bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

                        bull Conjugate Gradients (CG) GMRES Lanczos Arnoldi hellip bull Goal minimize communication

                        ndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

                        bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

                        ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

                        bull Lots of speed up possible (modeled and measured)ndash Price some redundant computationndash Challenges Poor partitioning Preconditioning Num Stability

                        75

                        Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                        ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                        ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                        bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                        bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                        Example The Difficulty of Tuning SpMV

                        bull n = 21200bull nnz = 15 M

                        bull Source NASA structural analysis problem (raefsky)

                        77

                        Example The Difficulty of Tuning

                        bull n = 21200bull nnz = 15 M

                        bull Source NASA structural analysis problem (raefsky)

                        bull 8x8 dense substructure exploit this to limit mem_refs

                        78

                        Speedups on Itanium 2 The Need for Search

                        Reference

                        Best 4x2

                        Mflops

                        Mflops

                        79

                        Register Profile Itanium 2

                        190 Mflops

                        1190 Mflops

                        80

                        Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16

                        Itanium 2 - 33Itanium 1 - 8

                        252 Mflops

                        122 Mflops

                        820 Mflops

                        459 Mflops

                        247 Mflops

                        107 Mflops

                        12 Gflops

                        190 Mflops

                        Another example of tuning challenges for SpMV

                        bull Ex11 matrix (fluid flow)

                        bull More complicated non-zero structure in general

                        bull N = 16614bull NNZ = 11M

                        82

                        Zoom in to top corner

                        bull More complicated non-zero structure in general

                        bull N = 16614bull NNZ = 11M

                        83

                        3x3 blocks look natural buthellip

                        bull Example 3x3 blockingndash Logical grid of 3x3 cells

                        bull But would lead to lots of ldquofill-inrdquo

                        84

                        Extra Work Can Improve Efficiency

                        bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

                        bull On Pentium III 15x speedupndash Actual mflop rate 152 = 225 higher

                        85

                        Source Accelerator Cavity Design Problem (Ko via Husbands)

                        86

                        100x100 Submatrix Along Diagonal

                        Summer School Lecture 787

                        Post-RCM Reordering

                        88

                        Effect of Combined RCM+TSP Reordering

                        Before Green + RedAfter Green + Blue

                        Summer School Lecture 789

                        2x speedups on Pentium 4 Power 4 hellip

                        Summary of Other Performance Optimizations

                        bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

                        bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

                        bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

                        90

                        Optimized Sparse Kernel Interface - OSKI

                        bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

                        bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

                        bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

                        software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

                        91

                        Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                        ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                        ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                        bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                        bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                        93

                        Example Classical Conjugate Gradient (CG)

                        SpMVs and dot products require communication in

                        each iteration

                        via CA Matrix Powers Kernel

                        Global reduction to compute G

                        94

                        Example CA-Conjugate Gradient

                        Local computations within inner loop require

                        no communication

                        Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                        ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                        ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                        bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                        bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                        96

                        Slower convergence due

                        to roundoff

                        Loss of accuracy due to roundoff

                        At s = 16 monomial basis is rank deficient Method breaks down

                        Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

                        CA-CG (monomial)CG

                        machine precision

                        97

                        Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                        ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                        ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                        bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                        bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                        What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

                        bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

                        bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

                        matrices

                        Explicit (O(nnz)) Implicit (o(nnz))

                        Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

                        Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

                        Indices

                        Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                        ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                        ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                        bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                        bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                        101

                        bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

                        ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

                        demanded by customers (construction engineers) otherwise they donrsquot believe results

                        ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

                        ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

                        Reproducible Floating Point Computation

                        Absolute Error for Random Vectors

                        Same magnitude opposite signs

                        Intel MKL non-reproducibility

                        Relative Error for Orthogonal vectors

                        Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

                        Sign notreproducible

                        103

                        bull Consider summation or dot productbull Goals

                        1 Same answer independent of layout processors order of summands

                        2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

                        bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

                        GoalsApproaches for Reproducibility

                        104

                        Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

                        Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

                        Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

                        bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

                        Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

                        bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

                        Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

                        Instruments NEC Nokia NVIDIA Samsung Oracle

                        bull bebopcsberkeleyedu

                        Summary

                        Donrsquot Communichellip

                        106

                        Time to redesign all linear algebra n-body hellip algorithms and software

                        (and compilers)

                        • Implementing Communication-Avoiding Algorithms
                        • Why avoid communication
                        • Goals
                        • Outline
                        • Outline (2)
                        • Lower bound for all ldquon3-likerdquo linear algebra
                        • Lower bound for all ldquon3-likerdquo linear algebra (2)
                        • Lower bound for all ldquon3-likerdquo linear algebra (3)
                        • Limits to parallel scaling (12)
                        • Limits to parallel scaling (22)
                        • Can we attain these lower bounds
                        • Outline (3)
                        • 25D Matrix Multiplication
                        • 25D Matrix Multiplication (2)
                        • 25D Matmul on BGP 16K nodes 64K cores (2)
                        • Perfect Strong Scaling ndash in Time and Energy (12)
                        • Perfect Strong Scaling ndash in Time and Energy (22)
                        • Handling Heterogeneity
                        • Application to Tensor Contractions
                        • C(ijk) = Σm A(ijm)B(mk)
                        • Application to Tensor Contractions (2)
                        • Communication Lower Bounds for Strassen-like matmul algorithms
                        • vs
                        • Slide 26
                        • Strassen-like beyond matmul
                        • Cache and Network Oblivious Algorithms
                        • CARMA Performance Distributed Memory
                        • CARMA Performance Distributed Memory (2)
                        • CARMA Performance Shared Memory
                        • CARMA Performance Shared Memory (2)
                        • Why is CARMA Faster in Shared Memory
                        • Outline (4)
                        • One-sided Factorizations (LU QR) so far
                        • TSQR An Architecture-Dependent Algorithm
                        • Back to LU Using similar idea for TSLU as TSQR Use reduction
                        • Minimizing Communication in TSLU
                        • Making TSLU Numerically Stable
                        • Stability of LU using TSLU CALU
                        • Why is stability of TSLU just a ldquoThmrdquo
                        • Fixing TSLU
                        • 2D CALU with Tournament Pivoting
                        • 25D CALU with Tournament Pivoting (c=4 copies)
                        • Exascale Machine Parameters Source DOE Exascale Workshop
                        • Exascale predicted speedups for Gaussian Elimination 2D CA
                        • 25D vs 2D LU With and Without Pivoting
                        • Other CA algorithms for Ax=b least squares(13)
                        • Other CA algorithms for Ax=b least squares (23)
                        • Other CA algorithms for Ax=b least squares (33)
                        • Outline (5)
                        • What about sparse matrices (13)
                        • Performance of 25D APSP using Kleene
                        • What about sparse matrices (23)
                        • What about sparse matrices (33)
                        • Outline (6)
                        • Symmetric Eigenproblem and SVD
                        • Slide 58
                        • Slide 59
                        • Slide 60
                        • Slide 61
                        • Slide 62
                        • Slide 63
                        • Slide 64
                        • Slide 65
                        • Slide 66
                        • Slide 67
                        • Slide 68
                        • Conventional vs CA - SBR
                        • Speedups of Sym Band Reduction vs DSBTRD
                        • Nonsymmetric Eigenproblem
                        • Attaining the Lower bounds Sequential
                        • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
                        • Outline (7)
                        • Avoiding Communication in Iterative Linear Algebra
                        • Outline (8)
                        • Example The Difficulty of Tuning SpMV
                        • Example The Difficulty of Tuning
                        • Speedups on Itanium 2 The Need for Search
                        • Register Profile Itanium 2
                        • Register Profiles IBM and Intel IA-64
                        • Another example of tuning challenges for SpMV
                        • Zoom in to top corner
                        • 3x3 blocks look natural buthellip
                        • Extra Work Can Improve Efficiency
                        • Slide 86
                        • Slide 87
                        • Slide 88
                        • Slide 89
                        • Summary of Other Performance Optimizations
                        • Optimized Sparse Kernel Interface - OSKI
                        • Outline (9)
                        • Example Classical Conjugate Gradient (CG)
                        • Example CA-Conjugate Gradient
                        • Outline (10)
                        • Slide 96
                        • Slide 97
                        • Outline (11)
                        • What is a ldquosparse matrixrdquo
                        • Outline (12)
                        • Reproducible Floating Point Computation
                        • Intel MKL non-reproducibility
                        • GoalsApproaches for Reproducibility
                        • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
                        • Collaborators and Supporters
                        • Summary

                          25D Matrix Multiplication

                          bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid

                          c

                          (Pc)12

                          (Pc)12

                          Example P = 32 c = 2

                          25D Matrix Multiplication

                          bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid

                          k

                          j

                          iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12

                          (1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)

                          (3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)

                          25D Matmul on BGP 16K nodes 64K coresc = 16 copies

                          Distinguished Paper Award EuroParrsquo11 (Solomonik D)SCrsquo11 paper by Solomonik Bhatele D

                          12x faster

                          27x faster

                          Perfect Strong Scaling ndash in Time and Energy (12)

                          bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2

                          bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model

                          ndash γT βT αT = secs per flop per word_moved per message of size m

                          bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ] = T(P)cbull Notation for energy model

                          ndash γE βE αE = joules for same operations

                          ndash δE = joules per word of memory used per sec

                          ndash εE = joules per sec for leakage etc

                          bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP) = E(P)bull Perfect scaling extends to N-body Strassen hellip

                          Perfect Strong Scaling ndash in Time and Energy (22)

                          bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ] = T(P)cbull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP) = E(P)

                          bull Can use these formulas to answer many questions such asndash How to choose p and M to minimize energy E needed for computationndash Given max allowed runtime T what is minimum energy E needed to achieve itndash Given max allowed energy E what is the minimum runtime T attainablendash Can we minimize the average power P = ETndash Given target energy efficiency what architectural parameters are needed to

                          achieve itbull Can we attain 75 GflopsWattbull Can we attain an exaflop for 20 MWatts

                          Handling Heterogeneitybull Suppose each of P processors could differ

                          ndash γi = secflop βi = secword αi = secmessage Mi = memory

                          bull What is optimal assignment of work Fi to minimize timendash Ti = Fi γi + Fi βi Mi

                          12 + Fi αi Mi32 = Fi [γi + βi Mi

                          12 + αi Mi32] = Fi ξi

                          ndash Choose Fi so Σi Fi = n3 and minimizing T = maxi Ti

                          ndash Answer Fi = n3(1ξi)Σj(1ξj) and T = n3Σj(1ξj)

                          bull Optimal Algorithm for nxn matmulndash Recursively divide into 8 half-sized subproblemsndash Assign subproblems to processor i to add up to Fi flops

                          bull Works for Strassen other algorithmshellip

                          Application to Tensor Contractions

                          bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

                          bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

                          bull Heavily used in electronic structure calculationsndash Ex NWChem

                          bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetries

                          ndash Solomonik Hammond Matthews

                          C(ijk) = Σm A(ijm)B(mk)

                          A3-fold symm

                          B2-fold symm

                          C2-fold symm

                          Application to Tensor Contractions

                          bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

                          bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

                          bull Heavily used in electronic structure calculationsndash Ex NWChem for coupled cluster (CC) approach to Schroedinger eqn

                          bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetriesndash Up to 3x faster running CC than NWChem on 3072 cores of Cray XE6 ndash Solomonik Hammond Matthews

                          Communication Lower Bounds for Strassen-like matmul algorithms

                          bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

                          bull Extends up to M = n2 p2ω bull Extends to rectangular case multiply (mxn)(nxp) in q mults

                          ndash words_moved = Ω (flopsM^(logmpq -1))

                          bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz also in JACMbull Is the lower bound attainable

                          Classical O(n3) matmul

                          words_moved =Ω (M(nM12)3P)

                          Strassenrsquos O(nlg7) matmul

                          words_moved =Ω (M(nM12)lg7P)

                          Strassen-like O(nω) matmul

                          words_moved =Ω (M(nM12)ωP)

                          vs

                          Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

                          Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

                          CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

                          Communication Avoiding Parallel Strassen (CAPS)

                          Best way to interleaveBFS and DFS is an tuning parameter

                          26

                          Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

                          Speedups 24-184(over previous Strassen-based algorithms)

                          Invited to appear as Research Highlight in CACM

                          Strassen-like beyond matmul

                          bull Thm (D Dumitriu Holtzrsquo07) Any Strassen-like O(nω) matmul algorithm can be used to build a numerically stable O(nω+η) algorithm for any ηgt0 for Ax=b least squares eig SVD hellipndash ηgt0 needed to deal with numerical stabilityndash Strassen already stable so η=0

                          bull Thm For sequential versions of these algorithms Words_moved = O(nω+ηM(ω+η)2 ndash 1 + n2 log n) ie attain expected lower bound

                          Ballard D Holtz Schwartz

                          Cache and Network Oblivious Algorithms

                          bull Motivation Minimizes communication at every level of a hierarchical system without tuning parameters (in theory)ndash Not always 25D Matmul on BGP was topology aware

                          bull CAPS Divide-and-conquer choose BFS or DFS to adapt to processors available memory

                          bull CARMAndash Divide-and-conquer classical matmul divide largest of 3

                          dimensions to create two subproblemsndash Choose BFS or DFS to adapt to processors available memory

                          CARMA Performance Distributed Memory

                          Square m = k = n = 6144

                          ScaLAPACK

                          CARMA

                          Peak

                          (log)

                          (log)

                          Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA

                          CARMA Performance Distributed Memory

                          Inner Product m = n = 192 k = 6291456

                          ScaLAPACK

                          CARMAPeak

                          (log)

                          (log)

                          Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA

                          CARMA Performance Shared Memory

                          Square m = k = n

                          MKL (double)CARMA (double)

                          MKL (single)CARMA (single)

                          Peak (single)

                          Peak (double)

                          (log)

                          (linear)

                          Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

                          CARMA Performance Shared Memory

                          Inner Product m = n = 64

                          MKL (double)

                          CARMA (double)

                          MKL (single)

                          CARMA (single)

                          (log)

                          (linear)

                          Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

                          Why is CARMA Faster in Shared MemoryL3 Cache Misses

                          Shared Memory Inner Product (m = n = 64 k = 524288)

                          97 Fewer Misses

                          86 Fewer Misses

                          (linear)

                          Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                          ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                          ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                          bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                          bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                          One-sided Factorizations (LU QR) so farbull Classical Approach for i=1 to n update column i update trailing matrixbull words_moved = O(n3)

                          35

                          bull Blocked Approach (LAPACK) for i=1 to nb update block i of b columns update trailing matrixbull words moved = O(n3M13)

                          bull Recursive Approach func factor(A) if A has 1 column update it

                          else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)

                          bull None of these approaches minimizes messagesbull Parallel case Partial

                          Pivoting =gt n reductionsbull Need another idea

                          TSQR An Architecture-Dependent Algorithm

                          W =

                          W0

                          W1

                          W2

                          W3

                          R00

                          R10

                          R20

                          R30

                          R01

                          R11

                          R02Parallel

                          W =

                          W0

                          W1

                          W2

                          W3

                          R01 R02

                          R00

                          R03

                          SequentialStreaming

                          W =

                          W0

                          W1

                          W2

                          W3

                          R00

                          R01R01

                          R11

                          R02

                          R11

                          R03

                          Dual Core

                          Can choose reduction tree dynamicallyMulticore Multisocket Multirack Multisite Out-of-core

                          Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

                          Wnxb =

                          W1

                          W2

                          W3

                          W4

                          P1middotL1middotU1

                          P2middotL2middotU2

                          P3middotL3middotU3

                          P4middotL4middotU4

                          =

                          Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

                          W1rsquoW2rsquoW3rsquoW4rsquo

                          P12middotL12middotU12

                          P34middotL34middotU34

                          =Choose b pivot rows call them W12rsquo

                          Choose b pivot rows call them W34rsquo

                          W12rsquoW34rsquo

                          = P1234middotL1234middotU1234

                          Choose b pivot rows

                          Go back to W and use these b pivot rows (move them to top do LU without pivoting)

                          37

                          Minimizing Communication in TSLU

                          W = W1

                          W2

                          W3

                          W4

                          LULULULU

                          LU

                          LULUParallel

                          W = W1

                          W2

                          W3

                          W4

                          LULU

                          LU

                          LUSequentialStreaming

                          W = W1

                          W2

                          W3

                          W4

                          LULU LU

                          LULU

                          LULU

                          Dual Core

                          Can choose reduction tree dynamically to match architecture as before

                          38

                          Making TSLU Numerically Stable

                          bull Details matterndash Going up the tree we could do LU either on original rows of A

                          (tournament pivoting) or computed rows of Undash Only tournament pivoting stable

                          bull ldquoThmrdquo New scheme as stable as Partial Pivoting (GEPP) in following sense Get same Schur complements as GEPP applied to different input matrix whose entries are blocks taken from input A

                          bull Why just a ldquoThmrdquo

                          39

                          Stability of LU using TSLU CALU

                          Summer School Lecture 4 40

                          bull Empirical testingndash Both random matrices and ldquospecial onesrdquondash Both binary tree (BCALU) and flat-tree (FCALU)ndash 3 metrics ||PA-LU||||A|| normwise and componentwise backward errorsndash See [D Grigori Xiang 2010] for details

                          Why is stability of TSLU just a ldquoThmrdquo

                          bull Proof is correct ndash in exact arithmeticbull Experiment

                          ndash Generate 100 random 6x6 rank 3 matrices in Matlabndash [LUP] = lu(A) do LU without pivoting on PA compare L factors are

                          they the samebull Compute || L ndash Lnp || A few 0rsquos A few infinrsquos a few NaNsbull Rest mostly O(1)

                          ndash Why Floating point is nonassociative doing arithmetic in different order gives different rounding errors

                          ndash Same experiment with rank 6 matrices || L ndash Lnp || usually nonzero O(macheps)

                          ndash Same experiment with 20x20 rank 4 matrices || L ndash Lnp || often O(103)

                          bull Much harder to break TSLU but possiblendash Occurred when using TSLU to factorize a low-rank subdiagonal

                          panel in symmetric-indefinite factorization 41

                          Fixing TSLU

                          bull Run TSLU quickly test for stability fix if necessary (rare)

                          bull Test conditioning of U if not tiny (usual case) proceed elsebull Compute || L || if not big (usual case) proceed elsebull Factor A = QR using TSQR thenbull Factor Q = PLU using TSLU thenbull A = PL(UR) with UR as upper triangular factor

                          bull Last topic in lecture how to guarantee floating point reproducibility

                          42

                          2D CALU with Tournament Pivoting

                          43

                          25D CALU with Tournament Pivoting (c=4 copies)

                          44

                          Exascale Machine ParametersSource DOE Exascale Workshop

                          bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

                          Exascale predicted speedupsfor Gaussian Elimination

                          2D CA-LU vs ScaLAPACK-LU

                          log2 (p)

                          log 2

                          (n2 p

                          ) =

                          log 2

                          (mem

                          ory_

                          per_

                          proc

                          )

                          Up to 29x

                          25D vs 2D LUWith and Without Pivoting

                          Other CA algorithms for Ax=b least squares(13)

                          bull A symmetric and indefinitendash Seek factorization that retains symmetry PAPT = LDLT D

                          ldquosimplerdquobull Save frac12 flops preserve inertia

                          ndash Usual approach Bunch-Kaufmanbull D block diagonal with 1x1 and 2x2 blocksbull Pivot search down column along row (lots of communication)

                          ndash Alternative Aasenbull D = tridiagonal = Tbull Two steps

                          ndash PAPT = LTLT where T is banded using TSLU

                          48

                          0 0

                          0

                          0 0

                          0

                          0

                          hellip

                          hellip

                          ndash Solvefactor narrow band problem with Tbull Up to 28x faster than MKL Best Paper at IPDPSrsquo13

                          Other CA algorithms for Ax=b least squares (23)bull Minimizing bandwidth and latency for sequential GEPP

                          ndash So far could not do partial pivoting and minimize messages just words

                          ndash Challengebull Column layout good for choosing pivots bad for matmulbull Blocked layout good for matmul bad for choosing pivots

                          ndash Solution use both layouts switching between thembull ldquoShape Morphing LUrdquo or SMLU

                          49

                          bull func factor(A) if A has 1 column update it else factor(left half of A)

                          update right half of A

                          factor(right half of A)

                          bull Words = O(n3M12)

                          bull Messages = O(n3M)

                          bull func factor(A) if A has 1 column update it else factor(left half of A) reshape to recursive block format update right half of A reshape to columnwise format factor(right half of A)

                          bull Words = O(n3M12)

                          bull Messages = O(n3M32)

                          Other CA algorithms for Ax=b least squares (33)bull Need for pivoting arises beyond LU in QR

                          ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)

                          ndash Usual approach like Partial Pivotingbull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound

                          ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two

                          groups of b columns either using usual approach or something better (GuEisenstat)

                          bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix

                          ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 50

                          Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                          ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                          ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                          bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                          bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                          What about sparse matrices (13)

                          bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

                          bull But canrsquot reorder outer loop for 25D need another ideabull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = AB

                          ndash Dependencies ok 25D works just different semiringbull Kleenersquos Algorithm

                          52

                          for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

                          D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

                          Performance of 25D APSP using Kleene

                          53

                          Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

                          62xspeedup

                          2x speedup

                          What about sparse matrices (23)

                          bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of G(A)

                          have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

                          bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering for

                          2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

                          bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

                          multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

                          separators)

                          54

                          What about sparse matrices (33)

                          bull If matrix stays very sparse lower bound unattainable new one bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

                          data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

                          bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

                          Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

                          bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

                          along dimensions most likely to minimize cost55

                          Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                          ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                          ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                          bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                          bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                          Symmetric Eigenproblem and SVD

                          bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd

                          bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12

                          ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea

                          b+1

                          b+1

                          Successive Band Reduction (BischofLangSun)

                          1

                          b+1

                          b+1

                          d+1

                          c

                          Successive Band Reduction (BischofLangSun)

                          b = bandwidthc = columnsd = diagonalsConstraint c+d b

                          1Q1

                          b+1

                          b+1

                          d+1

                          c

                          b = bandwidthc = columnsd = diagonalsConstraint c+d b

                          Successive Band Reduction (BischofLangSun)

                          12

                          Q1

                          b+1

                          b+1

                          d+1

                          d+c

                          d+c

                          c

                          b = bandwidthc = columnsd = diagonalsConstraint c+d b

                          Successive Band Reduction (BischofLangSun)

                          1

                          12

                          Q1

                          Q1T

                          b+1

                          b+1

                          d+1

                          d+1

                          cd+c

                          d+c

                          c

                          b = bandwidthc = columnsd = diagonalsConstraint c+d b

                          Successive Band Reduction (BischofLangSun)

                          1

                          1

                          2

                          2Q1

                          Q1T

                          b+1

                          b+1

                          d+1

                          d+1

                          cd+c

                          d+c

                          d+c

                          d+c

                          c

                          b = bandwidthc = columnsd = diagonalsConstraint c+d b

                          Successive Band Reduction (BischofLangSun)

                          1

                          1

                          2

                          2

                          3

                          3

                          Q1

                          Q1T

                          Q2

                          Q2T

                          b+1

                          b+1

                          d+1

                          d+1

                          d+c

                          d+c

                          d+c

                          d+c

                          c

                          c

                          b = bandwidthc = columnsd = diagonalsConstraint c+d b

                          Successive Band Reduction (BischofLangSun)

                          1

                          1

                          2

                          2

                          3

                          3

                          4

                          4

                          Q1

                          Q1T

                          Q2

                          Q2T

                          Q3

                          Q3T

                          b+1

                          b+1

                          d+1

                          d+1

                          d+c

                          d+c

                          d+c

                          d+c

                          c

                          c

                          b = bandwidthc = columnsd = diagonalsConstraint c+d b

                          Successive Band Reduction (BischofLangSun)

                          1

                          1

                          2

                          2

                          3

                          3

                          4

                          4

                          5

                          5

                          Q1

                          Q1T

                          Q2

                          Q2T

                          Q3

                          Q3T

                          Q4

                          Q4T

                          b+1

                          b+1

                          d+1

                          d+1

                          c

                          c

                          d+c

                          d+c

                          d+c

                          d+c

                          b = bandwidthc = columnsd = diagonalsConstraint c+d b

                          Successive Band Reduction (BischofLangSun)

                          1

                          1

                          2

                          2

                          3

                          3

                          4

                          4

                          5

                          5

                          Q5T

                          Q1

                          Q1T

                          Q2

                          Q2T

                          Q3

                          Q3T

                          Q5

                          Q4

                          Q4T

                          b+1

                          b+1

                          d+1

                          d+1

                          c

                          c

                          d+c

                          d+c

                          d+c

                          d+c

                          b = bandwidthc = columnsd = diagonalsConstraint c+d b

                          Successive Band Reduction (BischofLangSun)

                          1

                          1

                          2

                          2

                          3

                          3

                          4

                          4

                          5

                          5

                          6

                          6

                          Q5T

                          Q1

                          Q1T

                          Q2

                          Q2T

                          Q3

                          Q3T

                          Q5

                          Q4

                          Q4T

                          b+1

                          b+1

                          d+1

                          d+1

                          c

                          c

                          d+c

                          d+c

                          d+c

                          d+c

                          b = bandwidthc = columnsd = diagonalsConstraint c+d b

                          Successive Band Reduction (BischofLangSun)

                          Conventional vs CA - SBR

                          Conventional Communication-Avoiding

                          Touch all data 4 times Touch all data once

                          >
                          >

                          Speedups of Sym Band Reductionvs DSBTRD

                          bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

                          bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

                          bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

                          bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

                          bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

                          Nonsymmetric Eigenproblem

                          bull No apparent way to modify standard algorithmbull Instead Spectral Divide-and-Conquer

                          ndash Find orthogonal matrix Q whose leading columns span an invariant subspace of A

                          ndash QTAQ will be block upper triangularndash Apply recursively to A11 A22

                          ndash Depends on randomization1 Randomized Rank Revealing QR decomposition2 Randomized location to try splitting spectrum

                          A11 A12

                          ε A22

                          Attaining the Lower bounds SequentialLegend[Existing][Ours][Math-Lib][Random]

                          Two Levels Memory Hierarchy

                          Words Messages Words Messages

                          BLAS-3 [FLPRrsquo99][BDLSTrsquo13][MKL etc] [FLPRrsquo99][BDLSTrsquo13][MKL etc]

                          Cholesky[Grsquo97][APrsquo00]

                          [LAPACK][BDHSrsquo09]

                          [Grsquo97][APrsquo00][BDHSrsquo09] [Grsquo97][APrsquo00][BDHSrsquo09]

                          Sym Indefinite [BBDDDPSTYrsquo13] [BBDDDPSTYrsquo13]

                          LU[Grsquo97][Trsquo97]

                          [GDXrsquo11][BDLSTrsquo13]

                          [GDXrsquo11][BDLSTrsquo13]

                          [Grsquo97][Trsquo97] [BDLSTrsquo13] [BDLSTrsquo13]

                          QR[EGrsquo98][FWrsquo03]

                          [DGHLrsquo12][BDLSTrsquo13]

                          [FWrsquo03][DGHLrsquo12][BDLSTrsquo13]

                          [EGrsquo98][FWrsquo03][BDLSTrsquo13]

                          [FWrsquo03][BDLSTrsquo13]

                          Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

                          Sym Eig amp SVD [BDDrsquo11][BDKrsquo13] [BDDrsquo11]

                          Non Sym Eig [BDDrsquo11] [BDDrsquo11]

                          Legend[Existing][Ours][Math-Lib][Random]

                          Words (BW) Messages (L) Saving factor

                          BLAS-3 [AGZrsquo94][MTrsquo99][ScaLAPACK][Crsquo69][vGWrsquo97][SDrsquo11] L nP12

                          Cholesky [ScaLAPACK][Trsquo99][SDrsquo11] L nP12

                          Sym Indefinite [BBDDDPSTYrsquo13][ScaLAPACK] [BBDDDPSTYrsquo13] L nP12

                          LU [ScaLAPACK][GDXrsquo11][Trsquo99][SDrsquo11] [GDXrsquo11][Trsquo99][SDrsquo11] L nP12

                          QR [ScaLAPACK][DGHLrsquo12] [Trsquo99] [DGHLrsquo12][Trsquo99] L nP12

                          Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

                          Sym Eig amp SVD [BDDrsquo11][BDKrsquo13][ScaLAPACK] [BDDrsquo11][BDKrsquo13] L nP12

                          Non-Sym Eig [BDDrsquo11] [BDDrsquo11] BW P12 L n

                          Attaining with extra memory 25D M=(cn2P)

                          Attaining the Lower bounds Parallel 2DM=(n2P)(Ignoring poly-log(P) factors words = ( n2 P12) messages = (P12)

                          Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                          ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                          ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                          bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                          bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                          Avoiding Communication in Iterative Linear Algebra

                          bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

                          bull Conjugate Gradients (CG) GMRES Lanczos Arnoldi hellip bull Goal minimize communication

                          ndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

                          bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

                          ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

                          bull Lots of speed up possible (modeled and measured)ndash Price some redundant computationndash Challenges Poor partitioning Preconditioning Num Stability

                          75

                          Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                          ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                          ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                          bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                          bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                          Example The Difficulty of Tuning SpMV

                          bull n = 21200bull nnz = 15 M

                          bull Source NASA structural analysis problem (raefsky)

                          77

                          Example The Difficulty of Tuning

                          bull n = 21200bull nnz = 15 M

                          bull Source NASA structural analysis problem (raefsky)

                          bull 8x8 dense substructure exploit this to limit mem_refs

                          78

                          Speedups on Itanium 2 The Need for Search

                          Reference

                          Best 4x2

                          Mflops

                          Mflops

                          79

                          Register Profile Itanium 2

                          190 Mflops

                          1190 Mflops

                          80

                          Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16

                          Itanium 2 - 33Itanium 1 - 8

                          252 Mflops

                          122 Mflops

                          820 Mflops

                          459 Mflops

                          247 Mflops

                          107 Mflops

                          12 Gflops

                          190 Mflops

                          Another example of tuning challenges for SpMV

                          bull Ex11 matrix (fluid flow)

                          bull More complicated non-zero structure in general

                          bull N = 16614bull NNZ = 11M

                          82

                          Zoom in to top corner

                          bull More complicated non-zero structure in general

                          bull N = 16614bull NNZ = 11M

                          83

                          3x3 blocks look natural buthellip

                          bull Example 3x3 blockingndash Logical grid of 3x3 cells

                          bull But would lead to lots of ldquofill-inrdquo

                          84

                          Extra Work Can Improve Efficiency

                          bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

                          bull On Pentium III 15x speedupndash Actual mflop rate 152 = 225 higher

                          85

                          Source Accelerator Cavity Design Problem (Ko via Husbands)

                          86

                          100x100 Submatrix Along Diagonal

                          Summer School Lecture 787

                          Post-RCM Reordering

                          88

                          Effect of Combined RCM+TSP Reordering

                          Before Green + RedAfter Green + Blue

                          Summer School Lecture 789

                          2x speedups on Pentium 4 Power 4 hellip

                          Summary of Other Performance Optimizations

                          bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

                          bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

                          bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

                          90

                          Optimized Sparse Kernel Interface - OSKI

                          bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

                          bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

                          bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

                          software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

                          91

                          Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                          ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                          ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                          bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                          bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                          93

                          Example Classical Conjugate Gradient (CG)

                          SpMVs and dot products require communication in

                          each iteration

                          via CA Matrix Powers Kernel

                          Global reduction to compute G

                          94

                          Example CA-Conjugate Gradient

                          Local computations within inner loop require

                          no communication

                          Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                          ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                          ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                          bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                          bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                          96

                          Slower convergence due

                          to roundoff

                          Loss of accuracy due to roundoff

                          At s = 16 monomial basis is rank deficient Method breaks down

                          Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

                          CA-CG (monomial)CG

                          machine precision

                          97

                          Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                          ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                          ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                          bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                          bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                          What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

                          bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

                          bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

                          matrices

                          Explicit (O(nnz)) Implicit (o(nnz))

                          Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

                          Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

                          Indices

                          Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                          ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                          ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                          bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                          bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                          101

                          bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

                          ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

                          demanded by customers (construction engineers) otherwise they donrsquot believe results

                          ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

                          ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

                          Reproducible Floating Point Computation

                          Absolute Error for Random Vectors

                          Same magnitude opposite signs

                          Intel MKL non-reproducibility

                          Relative Error for Orthogonal vectors

                          Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

                          Sign notreproducible

                          103

                          bull Consider summation or dot productbull Goals

                          1 Same answer independent of layout processors order of summands

                          2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

                          bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

                          GoalsApproaches for Reproducibility

                          104

                          Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

                          Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

                          Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

                          bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

                          Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

                          bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

                          Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

                          Instruments NEC Nokia NVIDIA Samsung Oracle

                          bull bebopcsberkeleyedu

                          Summary

                          Donrsquot Communichellip

                          106

                          Time to redesign all linear algebra n-body hellip algorithms and software

                          (and compilers)

                          • Implementing Communication-Avoiding Algorithms
                          • Why avoid communication
                          • Goals
                          • Outline
                          • Outline (2)
                          • Lower bound for all ldquon3-likerdquo linear algebra
                          • Lower bound for all ldquon3-likerdquo linear algebra (2)
                          • Lower bound for all ldquon3-likerdquo linear algebra (3)
                          • Limits to parallel scaling (12)
                          • Limits to parallel scaling (22)
                          • Can we attain these lower bounds
                          • Outline (3)
                          • 25D Matrix Multiplication
                          • 25D Matrix Multiplication (2)
                          • 25D Matmul on BGP 16K nodes 64K cores (2)
                          • Perfect Strong Scaling ndash in Time and Energy (12)
                          • Perfect Strong Scaling ndash in Time and Energy (22)
                          • Handling Heterogeneity
                          • Application to Tensor Contractions
                          • C(ijk) = Σm A(ijm)B(mk)
                          • Application to Tensor Contractions (2)
                          • Communication Lower Bounds for Strassen-like matmul algorithms
                          • vs
                          • Slide 26
                          • Strassen-like beyond matmul
                          • Cache and Network Oblivious Algorithms
                          • CARMA Performance Distributed Memory
                          • CARMA Performance Distributed Memory (2)
                          • CARMA Performance Shared Memory
                          • CARMA Performance Shared Memory (2)
                          • Why is CARMA Faster in Shared Memory
                          • Outline (4)
                          • One-sided Factorizations (LU QR) so far
                          • TSQR An Architecture-Dependent Algorithm
                          • Back to LU Using similar idea for TSLU as TSQR Use reduction
                          • Minimizing Communication in TSLU
                          • Making TSLU Numerically Stable
                          • Stability of LU using TSLU CALU
                          • Why is stability of TSLU just a ldquoThmrdquo
                          • Fixing TSLU
                          • 2D CALU with Tournament Pivoting
                          • 25D CALU with Tournament Pivoting (c=4 copies)
                          • Exascale Machine Parameters Source DOE Exascale Workshop
                          • Exascale predicted speedups for Gaussian Elimination 2D CA
                          • 25D vs 2D LU With and Without Pivoting
                          • Other CA algorithms for Ax=b least squares(13)
                          • Other CA algorithms for Ax=b least squares (23)
                          • Other CA algorithms for Ax=b least squares (33)
                          • Outline (5)
                          • What about sparse matrices (13)
                          • Performance of 25D APSP using Kleene
                          • What about sparse matrices (23)
                          • What about sparse matrices (33)
                          • Outline (6)
                          • Symmetric Eigenproblem and SVD
                          • Slide 58
                          • Slide 59
                          • Slide 60
                          • Slide 61
                          • Slide 62
                          • Slide 63
                          • Slide 64
                          • Slide 65
                          • Slide 66
                          • Slide 67
                          • Slide 68
                          • Conventional vs CA - SBR
                          • Speedups of Sym Band Reduction vs DSBTRD
                          • Nonsymmetric Eigenproblem
                          • Attaining the Lower bounds Sequential
                          • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
                          • Outline (7)
                          • Avoiding Communication in Iterative Linear Algebra
                          • Outline (8)
                          • Example The Difficulty of Tuning SpMV
                          • Example The Difficulty of Tuning
                          • Speedups on Itanium 2 The Need for Search
                          • Register Profile Itanium 2
                          • Register Profiles IBM and Intel IA-64
                          • Another example of tuning challenges for SpMV
                          • Zoom in to top corner
                          • 3x3 blocks look natural buthellip
                          • Extra Work Can Improve Efficiency
                          • Slide 86
                          • Slide 87
                          • Slide 88
                          • Slide 89
                          • Summary of Other Performance Optimizations
                          • Optimized Sparse Kernel Interface - OSKI
                          • Outline (9)
                          • Example Classical Conjugate Gradient (CG)
                          • Example CA-Conjugate Gradient
                          • Outline (10)
                          • Slide 96
                          • Slide 97
                          • Outline (11)
                          • What is a ldquosparse matrixrdquo
                          • Outline (12)
                          • Reproducible Floating Point Computation
                          • Intel MKL non-reproducibility
                          • GoalsApproaches for Reproducibility
                          • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
                          • Collaborators and Supporters
                          • Summary

                            25D Matrix Multiplication

                            bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid

                            k

                            j

                            iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12

                            (1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)

                            (3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)

                            25D Matmul on BGP 16K nodes 64K coresc = 16 copies

                            Distinguished Paper Award EuroParrsquo11 (Solomonik D)SCrsquo11 paper by Solomonik Bhatele D

                            12x faster

                            27x faster

                            Perfect Strong Scaling ndash in Time and Energy (12)

                            bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2

                            bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model

                            ndash γT βT αT = secs per flop per word_moved per message of size m

                            bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ] = T(P)cbull Notation for energy model

                            ndash γE βE αE = joules for same operations

                            ndash δE = joules per word of memory used per sec

                            ndash εE = joules per sec for leakage etc

                            bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP) = E(P)bull Perfect scaling extends to N-body Strassen hellip

                            Perfect Strong Scaling ndash in Time and Energy (22)

                            bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ] = T(P)cbull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP) = E(P)

                            bull Can use these formulas to answer many questions such asndash How to choose p and M to minimize energy E needed for computationndash Given max allowed runtime T what is minimum energy E needed to achieve itndash Given max allowed energy E what is the minimum runtime T attainablendash Can we minimize the average power P = ETndash Given target energy efficiency what architectural parameters are needed to

                            achieve itbull Can we attain 75 GflopsWattbull Can we attain an exaflop for 20 MWatts

                            Handling Heterogeneitybull Suppose each of P processors could differ

                            ndash γi = secflop βi = secword αi = secmessage Mi = memory

                            bull What is optimal assignment of work Fi to minimize timendash Ti = Fi γi + Fi βi Mi

                            12 + Fi αi Mi32 = Fi [γi + βi Mi

                            12 + αi Mi32] = Fi ξi

                            ndash Choose Fi so Σi Fi = n3 and minimizing T = maxi Ti

                            ndash Answer Fi = n3(1ξi)Σj(1ξj) and T = n3Σj(1ξj)

                            bull Optimal Algorithm for nxn matmulndash Recursively divide into 8 half-sized subproblemsndash Assign subproblems to processor i to add up to Fi flops

                            bull Works for Strassen other algorithmshellip

                            Application to Tensor Contractions

                            bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

                            bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

                            bull Heavily used in electronic structure calculationsndash Ex NWChem

                            bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetries

                            ndash Solomonik Hammond Matthews

                            C(ijk) = Σm A(ijm)B(mk)

                            A3-fold symm

                            B2-fold symm

                            C2-fold symm

                            Application to Tensor Contractions

                            bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

                            bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

                            bull Heavily used in electronic structure calculationsndash Ex NWChem for coupled cluster (CC) approach to Schroedinger eqn

                            bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetriesndash Up to 3x faster running CC than NWChem on 3072 cores of Cray XE6 ndash Solomonik Hammond Matthews

                            Communication Lower Bounds for Strassen-like matmul algorithms

                            bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

                            bull Extends up to M = n2 p2ω bull Extends to rectangular case multiply (mxn)(nxp) in q mults

                            ndash words_moved = Ω (flopsM^(logmpq -1))

                            bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz also in JACMbull Is the lower bound attainable

                            Classical O(n3) matmul

                            words_moved =Ω (M(nM12)3P)

                            Strassenrsquos O(nlg7) matmul

                            words_moved =Ω (M(nM12)lg7P)

                            Strassen-like O(nω) matmul

                            words_moved =Ω (M(nM12)ωP)

                            vs

                            Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

                            Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

                            CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

                            Communication Avoiding Parallel Strassen (CAPS)

                            Best way to interleaveBFS and DFS is an tuning parameter

                            26

                            Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

                            Speedups 24-184(over previous Strassen-based algorithms)

                            Invited to appear as Research Highlight in CACM

                            Strassen-like beyond matmul

                            bull Thm (D Dumitriu Holtzrsquo07) Any Strassen-like O(nω) matmul algorithm can be used to build a numerically stable O(nω+η) algorithm for any ηgt0 for Ax=b least squares eig SVD hellipndash ηgt0 needed to deal with numerical stabilityndash Strassen already stable so η=0

                            bull Thm For sequential versions of these algorithms Words_moved = O(nω+ηM(ω+η)2 ndash 1 + n2 log n) ie attain expected lower bound

                            Ballard D Holtz Schwartz

                            Cache and Network Oblivious Algorithms

                            bull Motivation Minimizes communication at every level of a hierarchical system without tuning parameters (in theory)ndash Not always 25D Matmul on BGP was topology aware

                            bull CAPS Divide-and-conquer choose BFS or DFS to adapt to processors available memory

                            bull CARMAndash Divide-and-conquer classical matmul divide largest of 3

                            dimensions to create two subproblemsndash Choose BFS or DFS to adapt to processors available memory

                            CARMA Performance Distributed Memory

                            Square m = k = n = 6144

                            ScaLAPACK

                            CARMA

                            Peak

                            (log)

                            (log)

                            Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA

                            CARMA Performance Distributed Memory

                            Inner Product m = n = 192 k = 6291456

                            ScaLAPACK

                            CARMAPeak

                            (log)

                            (log)

                            Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA

                            CARMA Performance Shared Memory

                            Square m = k = n

                            MKL (double)CARMA (double)

                            MKL (single)CARMA (single)

                            Peak (single)

                            Peak (double)

                            (log)

                            (linear)

                            Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

                            CARMA Performance Shared Memory

                            Inner Product m = n = 64

                            MKL (double)

                            CARMA (double)

                            MKL (single)

                            CARMA (single)

                            (log)

                            (linear)

                            Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

                            Why is CARMA Faster in Shared MemoryL3 Cache Misses

                            Shared Memory Inner Product (m = n = 64 k = 524288)

                            97 Fewer Misses

                            86 Fewer Misses

                            (linear)

                            Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                            ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                            ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                            bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                            bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                            One-sided Factorizations (LU QR) so farbull Classical Approach for i=1 to n update column i update trailing matrixbull words_moved = O(n3)

                            35

                            bull Blocked Approach (LAPACK) for i=1 to nb update block i of b columns update trailing matrixbull words moved = O(n3M13)

                            bull Recursive Approach func factor(A) if A has 1 column update it

                            else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)

                            bull None of these approaches minimizes messagesbull Parallel case Partial

                            Pivoting =gt n reductionsbull Need another idea

                            TSQR An Architecture-Dependent Algorithm

                            W =

                            W0

                            W1

                            W2

                            W3

                            R00

                            R10

                            R20

                            R30

                            R01

                            R11

                            R02Parallel

                            W =

                            W0

                            W1

                            W2

                            W3

                            R01 R02

                            R00

                            R03

                            SequentialStreaming

                            W =

                            W0

                            W1

                            W2

                            W3

                            R00

                            R01R01

                            R11

                            R02

                            R11

                            R03

                            Dual Core

                            Can choose reduction tree dynamicallyMulticore Multisocket Multirack Multisite Out-of-core

                            Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

                            Wnxb =

                            W1

                            W2

                            W3

                            W4

                            P1middotL1middotU1

                            P2middotL2middotU2

                            P3middotL3middotU3

                            P4middotL4middotU4

                            =

                            Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

                            W1rsquoW2rsquoW3rsquoW4rsquo

                            P12middotL12middotU12

                            P34middotL34middotU34

                            =Choose b pivot rows call them W12rsquo

                            Choose b pivot rows call them W34rsquo

                            W12rsquoW34rsquo

                            = P1234middotL1234middotU1234

                            Choose b pivot rows

                            Go back to W and use these b pivot rows (move them to top do LU without pivoting)

                            37

                            Minimizing Communication in TSLU

                            W = W1

                            W2

                            W3

                            W4

                            LULULULU

                            LU

                            LULUParallel

                            W = W1

                            W2

                            W3

                            W4

                            LULU

                            LU

                            LUSequentialStreaming

                            W = W1

                            W2

                            W3

                            W4

                            LULU LU

                            LULU

                            LULU

                            Dual Core

                            Can choose reduction tree dynamically to match architecture as before

                            38

                            Making TSLU Numerically Stable

                            bull Details matterndash Going up the tree we could do LU either on original rows of A

                            (tournament pivoting) or computed rows of Undash Only tournament pivoting stable

                            bull ldquoThmrdquo New scheme as stable as Partial Pivoting (GEPP) in following sense Get same Schur complements as GEPP applied to different input matrix whose entries are blocks taken from input A

                            bull Why just a ldquoThmrdquo

                            39

                            Stability of LU using TSLU CALU

                            Summer School Lecture 4 40

                            bull Empirical testingndash Both random matrices and ldquospecial onesrdquondash Both binary tree (BCALU) and flat-tree (FCALU)ndash 3 metrics ||PA-LU||||A|| normwise and componentwise backward errorsndash See [D Grigori Xiang 2010] for details

                            Why is stability of TSLU just a ldquoThmrdquo

                            bull Proof is correct ndash in exact arithmeticbull Experiment

                            ndash Generate 100 random 6x6 rank 3 matrices in Matlabndash [LUP] = lu(A) do LU without pivoting on PA compare L factors are

                            they the samebull Compute || L ndash Lnp || A few 0rsquos A few infinrsquos a few NaNsbull Rest mostly O(1)

                            ndash Why Floating point is nonassociative doing arithmetic in different order gives different rounding errors

                            ndash Same experiment with rank 6 matrices || L ndash Lnp || usually nonzero O(macheps)

                            ndash Same experiment with 20x20 rank 4 matrices || L ndash Lnp || often O(103)

                            bull Much harder to break TSLU but possiblendash Occurred when using TSLU to factorize a low-rank subdiagonal

                            panel in symmetric-indefinite factorization 41

                            Fixing TSLU

                            bull Run TSLU quickly test for stability fix if necessary (rare)

                            bull Test conditioning of U if not tiny (usual case) proceed elsebull Compute || L || if not big (usual case) proceed elsebull Factor A = QR using TSQR thenbull Factor Q = PLU using TSLU thenbull A = PL(UR) with UR as upper triangular factor

                            bull Last topic in lecture how to guarantee floating point reproducibility

                            42

                            2D CALU with Tournament Pivoting

                            43

                            25D CALU with Tournament Pivoting (c=4 copies)

                            44

                            Exascale Machine ParametersSource DOE Exascale Workshop

                            bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

                            Exascale predicted speedupsfor Gaussian Elimination

                            2D CA-LU vs ScaLAPACK-LU

                            log2 (p)

                            log 2

                            (n2 p

                            ) =

                            log 2

                            (mem

                            ory_

                            per_

                            proc

                            )

                            Up to 29x

                            25D vs 2D LUWith and Without Pivoting

                            Other CA algorithms for Ax=b least squares(13)

                            bull A symmetric and indefinitendash Seek factorization that retains symmetry PAPT = LDLT D

                            ldquosimplerdquobull Save frac12 flops preserve inertia

                            ndash Usual approach Bunch-Kaufmanbull D block diagonal with 1x1 and 2x2 blocksbull Pivot search down column along row (lots of communication)

                            ndash Alternative Aasenbull D = tridiagonal = Tbull Two steps

                            ndash PAPT = LTLT where T is banded using TSLU

                            48

                            0 0

                            0

                            0 0

                            0

                            0

                            hellip

                            hellip

                            ndash Solvefactor narrow band problem with Tbull Up to 28x faster than MKL Best Paper at IPDPSrsquo13

                            Other CA algorithms for Ax=b least squares (23)bull Minimizing bandwidth and latency for sequential GEPP

                            ndash So far could not do partial pivoting and minimize messages just words

                            ndash Challengebull Column layout good for choosing pivots bad for matmulbull Blocked layout good for matmul bad for choosing pivots

                            ndash Solution use both layouts switching between thembull ldquoShape Morphing LUrdquo or SMLU

                            49

                            bull func factor(A) if A has 1 column update it else factor(left half of A)

                            update right half of A

                            factor(right half of A)

                            bull Words = O(n3M12)

                            bull Messages = O(n3M)

                            bull func factor(A) if A has 1 column update it else factor(left half of A) reshape to recursive block format update right half of A reshape to columnwise format factor(right half of A)

                            bull Words = O(n3M12)

                            bull Messages = O(n3M32)

                            Other CA algorithms for Ax=b least squares (33)bull Need for pivoting arises beyond LU in QR

                            ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)

                            ndash Usual approach like Partial Pivotingbull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound

                            ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two

                            groups of b columns either using usual approach or something better (GuEisenstat)

                            bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix

                            ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 50

                            Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                            ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                            ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                            bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                            bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                            What about sparse matrices (13)

                            bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

                            bull But canrsquot reorder outer loop for 25D need another ideabull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = AB

                            ndash Dependencies ok 25D works just different semiringbull Kleenersquos Algorithm

                            52

                            for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

                            D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

                            Performance of 25D APSP using Kleene

                            53

                            Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

                            62xspeedup

                            2x speedup

                            What about sparse matrices (23)

                            bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of G(A)

                            have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

                            bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering for

                            2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

                            bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

                            multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

                            separators)

                            54

                            What about sparse matrices (33)

                            bull If matrix stays very sparse lower bound unattainable new one bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

                            data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

                            bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

                            Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

                            bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

                            along dimensions most likely to minimize cost55

                            Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                            ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                            ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                            bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                            bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                            Symmetric Eigenproblem and SVD

                            bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd

                            bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12

                            ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea

                            b+1

                            b+1

                            Successive Band Reduction (BischofLangSun)

                            1

                            b+1

                            b+1

                            d+1

                            c

                            Successive Band Reduction (BischofLangSun)

                            b = bandwidthc = columnsd = diagonalsConstraint c+d b

                            1Q1

                            b+1

                            b+1

                            d+1

                            c

                            b = bandwidthc = columnsd = diagonalsConstraint c+d b

                            Successive Band Reduction (BischofLangSun)

                            12

                            Q1

                            b+1

                            b+1

                            d+1

                            d+c

                            d+c

                            c

                            b = bandwidthc = columnsd = diagonalsConstraint c+d b

                            Successive Band Reduction (BischofLangSun)

                            1

                            12

                            Q1

                            Q1T

                            b+1

                            b+1

                            d+1

                            d+1

                            cd+c

                            d+c

                            c

                            b = bandwidthc = columnsd = diagonalsConstraint c+d b

                            Successive Band Reduction (BischofLangSun)

                            1

                            1

                            2

                            2Q1

                            Q1T

                            b+1

                            b+1

                            d+1

                            d+1

                            cd+c

                            d+c

                            d+c

                            d+c

                            c

                            b = bandwidthc = columnsd = diagonalsConstraint c+d b

                            Successive Band Reduction (BischofLangSun)

                            1

                            1

                            2

                            2

                            3

                            3

                            Q1

                            Q1T

                            Q2

                            Q2T

                            b+1

                            b+1

                            d+1

                            d+1

                            d+c

                            d+c

                            d+c

                            d+c

                            c

                            c

                            b = bandwidthc = columnsd = diagonalsConstraint c+d b

                            Successive Band Reduction (BischofLangSun)

                            1

                            1

                            2

                            2

                            3

                            3

                            4

                            4

                            Q1

                            Q1T

                            Q2

                            Q2T

                            Q3

                            Q3T

                            b+1

                            b+1

                            d+1

                            d+1

                            d+c

                            d+c

                            d+c

                            d+c

                            c

                            c

                            b = bandwidthc = columnsd = diagonalsConstraint c+d b

                            Successive Band Reduction (BischofLangSun)

                            1

                            1

                            2

                            2

                            3

                            3

                            4

                            4

                            5

                            5

                            Q1

                            Q1T

                            Q2

                            Q2T

                            Q3

                            Q3T

                            Q4

                            Q4T

                            b+1

                            b+1

                            d+1

                            d+1

                            c

                            c

                            d+c

                            d+c

                            d+c

                            d+c

                            b = bandwidthc = columnsd = diagonalsConstraint c+d b

                            Successive Band Reduction (BischofLangSun)

                            1

                            1

                            2

                            2

                            3

                            3

                            4

                            4

                            5

                            5

                            Q5T

                            Q1

                            Q1T

                            Q2

                            Q2T

                            Q3

                            Q3T

                            Q5

                            Q4

                            Q4T

                            b+1

                            b+1

                            d+1

                            d+1

                            c

                            c

                            d+c

                            d+c

                            d+c

                            d+c

                            b = bandwidthc = columnsd = diagonalsConstraint c+d b

                            Successive Band Reduction (BischofLangSun)

                            1

                            1

                            2

                            2

                            3

                            3

                            4

                            4

                            5

                            5

                            6

                            6

                            Q5T

                            Q1

                            Q1T

                            Q2

                            Q2T

                            Q3

                            Q3T

                            Q5

                            Q4

                            Q4T

                            b+1

                            b+1

                            d+1

                            d+1

                            c

                            c

                            d+c

                            d+c

                            d+c

                            d+c

                            b = bandwidthc = columnsd = diagonalsConstraint c+d b

                            Successive Band Reduction (BischofLangSun)

                            Conventional vs CA - SBR

                            Conventional Communication-Avoiding

                            Touch all data 4 times Touch all data once

                            >
                            >

                            Speedups of Sym Band Reductionvs DSBTRD

                            bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

                            bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

                            bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

                            bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

                            bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

                            Nonsymmetric Eigenproblem

                            bull No apparent way to modify standard algorithmbull Instead Spectral Divide-and-Conquer

                            ndash Find orthogonal matrix Q whose leading columns span an invariant subspace of A

                            ndash QTAQ will be block upper triangularndash Apply recursively to A11 A22

                            ndash Depends on randomization1 Randomized Rank Revealing QR decomposition2 Randomized location to try splitting spectrum

                            A11 A12

                            ε A22

                            Attaining the Lower bounds SequentialLegend[Existing][Ours][Math-Lib][Random]

                            Two Levels Memory Hierarchy

                            Words Messages Words Messages

                            BLAS-3 [FLPRrsquo99][BDLSTrsquo13][MKL etc] [FLPRrsquo99][BDLSTrsquo13][MKL etc]

                            Cholesky[Grsquo97][APrsquo00]

                            [LAPACK][BDHSrsquo09]

                            [Grsquo97][APrsquo00][BDHSrsquo09] [Grsquo97][APrsquo00][BDHSrsquo09]

                            Sym Indefinite [BBDDDPSTYrsquo13] [BBDDDPSTYrsquo13]

                            LU[Grsquo97][Trsquo97]

                            [GDXrsquo11][BDLSTrsquo13]

                            [GDXrsquo11][BDLSTrsquo13]

                            [Grsquo97][Trsquo97] [BDLSTrsquo13] [BDLSTrsquo13]

                            QR[EGrsquo98][FWrsquo03]

                            [DGHLrsquo12][BDLSTrsquo13]

                            [FWrsquo03][DGHLrsquo12][BDLSTrsquo13]

                            [EGrsquo98][FWrsquo03][BDLSTrsquo13]

                            [FWrsquo03][BDLSTrsquo13]

                            Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

                            Sym Eig amp SVD [BDDrsquo11][BDKrsquo13] [BDDrsquo11]

                            Non Sym Eig [BDDrsquo11] [BDDrsquo11]

                            Legend[Existing][Ours][Math-Lib][Random]

                            Words (BW) Messages (L) Saving factor

                            BLAS-3 [AGZrsquo94][MTrsquo99][ScaLAPACK][Crsquo69][vGWrsquo97][SDrsquo11] L nP12

                            Cholesky [ScaLAPACK][Trsquo99][SDrsquo11] L nP12

                            Sym Indefinite [BBDDDPSTYrsquo13][ScaLAPACK] [BBDDDPSTYrsquo13] L nP12

                            LU [ScaLAPACK][GDXrsquo11][Trsquo99][SDrsquo11] [GDXrsquo11][Trsquo99][SDrsquo11] L nP12

                            QR [ScaLAPACK][DGHLrsquo12] [Trsquo99] [DGHLrsquo12][Trsquo99] L nP12

                            Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

                            Sym Eig amp SVD [BDDrsquo11][BDKrsquo13][ScaLAPACK] [BDDrsquo11][BDKrsquo13] L nP12

                            Non-Sym Eig [BDDrsquo11] [BDDrsquo11] BW P12 L n

                            Attaining with extra memory 25D M=(cn2P)

                            Attaining the Lower bounds Parallel 2DM=(n2P)(Ignoring poly-log(P) factors words = ( n2 P12) messages = (P12)

                            Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                            ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                            ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                            bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                            bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                            Avoiding Communication in Iterative Linear Algebra

                            bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

                            bull Conjugate Gradients (CG) GMRES Lanczos Arnoldi hellip bull Goal minimize communication

                            ndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

                            bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

                            ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

                            bull Lots of speed up possible (modeled and measured)ndash Price some redundant computationndash Challenges Poor partitioning Preconditioning Num Stability

                            75

                            Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                            ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                            ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                            bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                            bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                            Example The Difficulty of Tuning SpMV

                            bull n = 21200bull nnz = 15 M

                            bull Source NASA structural analysis problem (raefsky)

                            77

                            Example The Difficulty of Tuning

                            bull n = 21200bull nnz = 15 M

                            bull Source NASA structural analysis problem (raefsky)

                            bull 8x8 dense substructure exploit this to limit mem_refs

                            78

                            Speedups on Itanium 2 The Need for Search

                            Reference

                            Best 4x2

                            Mflops

                            Mflops

                            79

                            Register Profile Itanium 2

                            190 Mflops

                            1190 Mflops

                            80

                            Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16

                            Itanium 2 - 33Itanium 1 - 8

                            252 Mflops

                            122 Mflops

                            820 Mflops

                            459 Mflops

                            247 Mflops

                            107 Mflops

                            12 Gflops

                            190 Mflops

                            Another example of tuning challenges for SpMV

                            bull Ex11 matrix (fluid flow)

                            bull More complicated non-zero structure in general

                            bull N = 16614bull NNZ = 11M

                            82

                            Zoom in to top corner

                            bull More complicated non-zero structure in general

                            bull N = 16614bull NNZ = 11M

                            83

                            3x3 blocks look natural buthellip

                            bull Example 3x3 blockingndash Logical grid of 3x3 cells

                            bull But would lead to lots of ldquofill-inrdquo

                            84

                            Extra Work Can Improve Efficiency

                            bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

                            bull On Pentium III 15x speedupndash Actual mflop rate 152 = 225 higher

                            85

                            Source Accelerator Cavity Design Problem (Ko via Husbands)

                            86

                            100x100 Submatrix Along Diagonal

                            Summer School Lecture 787

                            Post-RCM Reordering

                            88

                            Effect of Combined RCM+TSP Reordering

                            Before Green + RedAfter Green + Blue

                            Summer School Lecture 789

                            2x speedups on Pentium 4 Power 4 hellip

                            Summary of Other Performance Optimizations

                            bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

                            bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

                            bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

                            90

                            Optimized Sparse Kernel Interface - OSKI

                            bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

                            bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

                            bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

                            software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

                            91

                            Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                            ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                            ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                            bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                            bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                            93

                            Example Classical Conjugate Gradient (CG)

                            SpMVs and dot products require communication in

                            each iteration

                            via CA Matrix Powers Kernel

                            Global reduction to compute G

                            94

                            Example CA-Conjugate Gradient

                            Local computations within inner loop require

                            no communication

                            Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                            ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                            ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                            bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                            bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                            96

                            Slower convergence due

                            to roundoff

                            Loss of accuracy due to roundoff

                            At s = 16 monomial basis is rank deficient Method breaks down

                            Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

                            CA-CG (monomial)CG

                            machine precision

                            97

                            Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                            ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                            ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                            bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                            bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                            What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

                            bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

                            bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

                            matrices

                            Explicit (O(nnz)) Implicit (o(nnz))

                            Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

                            Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

                            Indices

                            Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                            ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                            ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                            bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                            bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                            101

                            bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

                            ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

                            demanded by customers (construction engineers) otherwise they donrsquot believe results

                            ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

                            ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

                            Reproducible Floating Point Computation

                            Absolute Error for Random Vectors

                            Same magnitude opposite signs

                            Intel MKL non-reproducibility

                            Relative Error for Orthogonal vectors

                            Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

                            Sign notreproducible

                            103

                            bull Consider summation or dot productbull Goals

                            1 Same answer independent of layout processors order of summands

                            2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

                            bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

                            GoalsApproaches for Reproducibility

                            104

                            Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

                            Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

                            Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

                            bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

                            Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

                            bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

                            Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

                            Instruments NEC Nokia NVIDIA Samsung Oracle

                            bull bebopcsberkeleyedu

                            Summary

                            Donrsquot Communichellip

                            106

                            Time to redesign all linear algebra n-body hellip algorithms and software

                            (and compilers)

                            • Implementing Communication-Avoiding Algorithms
                            • Why avoid communication
                            • Goals
                            • Outline
                            • Outline (2)
                            • Lower bound for all ldquon3-likerdquo linear algebra
                            • Lower bound for all ldquon3-likerdquo linear algebra (2)
                            • Lower bound for all ldquon3-likerdquo linear algebra (3)
                            • Limits to parallel scaling (12)
                            • Limits to parallel scaling (22)
                            • Can we attain these lower bounds
                            • Outline (3)
                            • 25D Matrix Multiplication
                            • 25D Matrix Multiplication (2)
                            • 25D Matmul on BGP 16K nodes 64K cores (2)
                            • Perfect Strong Scaling ndash in Time and Energy (12)
                            • Perfect Strong Scaling ndash in Time and Energy (22)
                            • Handling Heterogeneity
                            • Application to Tensor Contractions
                            • C(ijk) = Σm A(ijm)B(mk)
                            • Application to Tensor Contractions (2)
                            • Communication Lower Bounds for Strassen-like matmul algorithms
                            • vs
                            • Slide 26
                            • Strassen-like beyond matmul
                            • Cache and Network Oblivious Algorithms
                            • CARMA Performance Distributed Memory
                            • CARMA Performance Distributed Memory (2)
                            • CARMA Performance Shared Memory
                            • CARMA Performance Shared Memory (2)
                            • Why is CARMA Faster in Shared Memory
                            • Outline (4)
                            • One-sided Factorizations (LU QR) so far
                            • TSQR An Architecture-Dependent Algorithm
                            • Back to LU Using similar idea for TSLU as TSQR Use reduction
                            • Minimizing Communication in TSLU
                            • Making TSLU Numerically Stable
                            • Stability of LU using TSLU CALU
                            • Why is stability of TSLU just a ldquoThmrdquo
                            • Fixing TSLU
                            • 2D CALU with Tournament Pivoting
                            • 25D CALU with Tournament Pivoting (c=4 copies)
                            • Exascale Machine Parameters Source DOE Exascale Workshop
                            • Exascale predicted speedups for Gaussian Elimination 2D CA
                            • 25D vs 2D LU With and Without Pivoting
                            • Other CA algorithms for Ax=b least squares(13)
                            • Other CA algorithms for Ax=b least squares (23)
                            • Other CA algorithms for Ax=b least squares (33)
                            • Outline (5)
                            • What about sparse matrices (13)
                            • Performance of 25D APSP using Kleene
                            • What about sparse matrices (23)
                            • What about sparse matrices (33)
                            • Outline (6)
                            • Symmetric Eigenproblem and SVD
                            • Slide 58
                            • Slide 59
                            • Slide 60
                            • Slide 61
                            • Slide 62
                            • Slide 63
                            • Slide 64
                            • Slide 65
                            • Slide 66
                            • Slide 67
                            • Slide 68
                            • Conventional vs CA - SBR
                            • Speedups of Sym Band Reduction vs DSBTRD
                            • Nonsymmetric Eigenproblem
                            • Attaining the Lower bounds Sequential
                            • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
                            • Outline (7)
                            • Avoiding Communication in Iterative Linear Algebra
                            • Outline (8)
                            • Example The Difficulty of Tuning SpMV
                            • Example The Difficulty of Tuning
                            • Speedups on Itanium 2 The Need for Search
                            • Register Profile Itanium 2
                            • Register Profiles IBM and Intel IA-64
                            • Another example of tuning challenges for SpMV
                            • Zoom in to top corner
                            • 3x3 blocks look natural buthellip
                            • Extra Work Can Improve Efficiency
                            • Slide 86
                            • Slide 87
                            • Slide 88
                            • Slide 89
                            • Summary of Other Performance Optimizations
                            • Optimized Sparse Kernel Interface - OSKI
                            • Outline (9)
                            • Example Classical Conjugate Gradient (CG)
                            • Example CA-Conjugate Gradient
                            • Outline (10)
                            • Slide 96
                            • Slide 97
                            • Outline (11)
                            • What is a ldquosparse matrixrdquo
                            • Outline (12)
                            • Reproducible Floating Point Computation
                            • Intel MKL non-reproducibility
                            • GoalsApproaches for Reproducibility
                            • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
                            • Collaborators and Supporters
                            • Summary

                              25D Matmul on BGP 16K nodes 64K coresc = 16 copies

                              Distinguished Paper Award EuroParrsquo11 (Solomonik D)SCrsquo11 paper by Solomonik Bhatele D

                              12x faster

                              27x faster

                              Perfect Strong Scaling ndash in Time and Energy (12)

                              bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2

                              bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model

                              ndash γT βT αT = secs per flop per word_moved per message of size m

                              bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ] = T(P)cbull Notation for energy model

                              ndash γE βE αE = joules for same operations

                              ndash δE = joules per word of memory used per sec

                              ndash εE = joules per sec for leakage etc

                              bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP) = E(P)bull Perfect scaling extends to N-body Strassen hellip

                              Perfect Strong Scaling ndash in Time and Energy (22)

                              bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ] = T(P)cbull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP) = E(P)

                              bull Can use these formulas to answer many questions such asndash How to choose p and M to minimize energy E needed for computationndash Given max allowed runtime T what is minimum energy E needed to achieve itndash Given max allowed energy E what is the minimum runtime T attainablendash Can we minimize the average power P = ETndash Given target energy efficiency what architectural parameters are needed to

                              achieve itbull Can we attain 75 GflopsWattbull Can we attain an exaflop for 20 MWatts

                              Handling Heterogeneitybull Suppose each of P processors could differ

                              ndash γi = secflop βi = secword αi = secmessage Mi = memory

                              bull What is optimal assignment of work Fi to minimize timendash Ti = Fi γi + Fi βi Mi

                              12 + Fi αi Mi32 = Fi [γi + βi Mi

                              12 + αi Mi32] = Fi ξi

                              ndash Choose Fi so Σi Fi = n3 and minimizing T = maxi Ti

                              ndash Answer Fi = n3(1ξi)Σj(1ξj) and T = n3Σj(1ξj)

                              bull Optimal Algorithm for nxn matmulndash Recursively divide into 8 half-sized subproblemsndash Assign subproblems to processor i to add up to Fi flops

                              bull Works for Strassen other algorithmshellip

                              Application to Tensor Contractions

                              bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

                              bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

                              bull Heavily used in electronic structure calculationsndash Ex NWChem

                              bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetries

                              ndash Solomonik Hammond Matthews

                              C(ijk) = Σm A(ijm)B(mk)

                              A3-fold symm

                              B2-fold symm

                              C2-fold symm

                              Application to Tensor Contractions

                              bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

                              bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

                              bull Heavily used in electronic structure calculationsndash Ex NWChem for coupled cluster (CC) approach to Schroedinger eqn

                              bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetriesndash Up to 3x faster running CC than NWChem on 3072 cores of Cray XE6 ndash Solomonik Hammond Matthews

                              Communication Lower Bounds for Strassen-like matmul algorithms

                              bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

                              bull Extends up to M = n2 p2ω bull Extends to rectangular case multiply (mxn)(nxp) in q mults

                              ndash words_moved = Ω (flopsM^(logmpq -1))

                              bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz also in JACMbull Is the lower bound attainable

                              Classical O(n3) matmul

                              words_moved =Ω (M(nM12)3P)

                              Strassenrsquos O(nlg7) matmul

                              words_moved =Ω (M(nM12)lg7P)

                              Strassen-like O(nω) matmul

                              words_moved =Ω (M(nM12)ωP)

                              vs

                              Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

                              Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

                              CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

                              Communication Avoiding Parallel Strassen (CAPS)

                              Best way to interleaveBFS and DFS is an tuning parameter

                              26

                              Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

                              Speedups 24-184(over previous Strassen-based algorithms)

                              Invited to appear as Research Highlight in CACM

                              Strassen-like beyond matmul

                              bull Thm (D Dumitriu Holtzrsquo07) Any Strassen-like O(nω) matmul algorithm can be used to build a numerically stable O(nω+η) algorithm for any ηgt0 for Ax=b least squares eig SVD hellipndash ηgt0 needed to deal with numerical stabilityndash Strassen already stable so η=0

                              bull Thm For sequential versions of these algorithms Words_moved = O(nω+ηM(ω+η)2 ndash 1 + n2 log n) ie attain expected lower bound

                              Ballard D Holtz Schwartz

                              Cache and Network Oblivious Algorithms

                              bull Motivation Minimizes communication at every level of a hierarchical system without tuning parameters (in theory)ndash Not always 25D Matmul on BGP was topology aware

                              bull CAPS Divide-and-conquer choose BFS or DFS to adapt to processors available memory

                              bull CARMAndash Divide-and-conquer classical matmul divide largest of 3

                              dimensions to create two subproblemsndash Choose BFS or DFS to adapt to processors available memory

                              CARMA Performance Distributed Memory

                              Square m = k = n = 6144

                              ScaLAPACK

                              CARMA

                              Peak

                              (log)

                              (log)

                              Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA

                              CARMA Performance Distributed Memory

                              Inner Product m = n = 192 k = 6291456

                              ScaLAPACK

                              CARMAPeak

                              (log)

                              (log)

                              Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA

                              CARMA Performance Shared Memory

                              Square m = k = n

                              MKL (double)CARMA (double)

                              MKL (single)CARMA (single)

                              Peak (single)

                              Peak (double)

                              (log)

                              (linear)

                              Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

                              CARMA Performance Shared Memory

                              Inner Product m = n = 64

                              MKL (double)

                              CARMA (double)

                              MKL (single)

                              CARMA (single)

                              (log)

                              (linear)

                              Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

                              Why is CARMA Faster in Shared MemoryL3 Cache Misses

                              Shared Memory Inner Product (m = n = 64 k = 524288)

                              97 Fewer Misses

                              86 Fewer Misses

                              (linear)

                              Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                              ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                              ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                              bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                              bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                              One-sided Factorizations (LU QR) so farbull Classical Approach for i=1 to n update column i update trailing matrixbull words_moved = O(n3)

                              35

                              bull Blocked Approach (LAPACK) for i=1 to nb update block i of b columns update trailing matrixbull words moved = O(n3M13)

                              bull Recursive Approach func factor(A) if A has 1 column update it

                              else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)

                              bull None of these approaches minimizes messagesbull Parallel case Partial

                              Pivoting =gt n reductionsbull Need another idea

                              TSQR An Architecture-Dependent Algorithm

                              W =

                              W0

                              W1

                              W2

                              W3

                              R00

                              R10

                              R20

                              R30

                              R01

                              R11

                              R02Parallel

                              W =

                              W0

                              W1

                              W2

                              W3

                              R01 R02

                              R00

                              R03

                              SequentialStreaming

                              W =

                              W0

                              W1

                              W2

                              W3

                              R00

                              R01R01

                              R11

                              R02

                              R11

                              R03

                              Dual Core

                              Can choose reduction tree dynamicallyMulticore Multisocket Multirack Multisite Out-of-core

                              Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

                              Wnxb =

                              W1

                              W2

                              W3

                              W4

                              P1middotL1middotU1

                              P2middotL2middotU2

                              P3middotL3middotU3

                              P4middotL4middotU4

                              =

                              Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

                              W1rsquoW2rsquoW3rsquoW4rsquo

                              P12middotL12middotU12

                              P34middotL34middotU34

                              =Choose b pivot rows call them W12rsquo

                              Choose b pivot rows call them W34rsquo

                              W12rsquoW34rsquo

                              = P1234middotL1234middotU1234

                              Choose b pivot rows

                              Go back to W and use these b pivot rows (move them to top do LU without pivoting)

                              37

                              Minimizing Communication in TSLU

                              W = W1

                              W2

                              W3

                              W4

                              LULULULU

                              LU

                              LULUParallel

                              W = W1

                              W2

                              W3

                              W4

                              LULU

                              LU

                              LUSequentialStreaming

                              W = W1

                              W2

                              W3

                              W4

                              LULU LU

                              LULU

                              LULU

                              Dual Core

                              Can choose reduction tree dynamically to match architecture as before

                              38

                              Making TSLU Numerically Stable

                              bull Details matterndash Going up the tree we could do LU either on original rows of A

                              (tournament pivoting) or computed rows of Undash Only tournament pivoting stable

                              bull ldquoThmrdquo New scheme as stable as Partial Pivoting (GEPP) in following sense Get same Schur complements as GEPP applied to different input matrix whose entries are blocks taken from input A

                              bull Why just a ldquoThmrdquo

                              39

                              Stability of LU using TSLU CALU

                              Summer School Lecture 4 40

                              bull Empirical testingndash Both random matrices and ldquospecial onesrdquondash Both binary tree (BCALU) and flat-tree (FCALU)ndash 3 metrics ||PA-LU||||A|| normwise and componentwise backward errorsndash See [D Grigori Xiang 2010] for details

                              Why is stability of TSLU just a ldquoThmrdquo

                              bull Proof is correct ndash in exact arithmeticbull Experiment

                              ndash Generate 100 random 6x6 rank 3 matrices in Matlabndash [LUP] = lu(A) do LU without pivoting on PA compare L factors are

                              they the samebull Compute || L ndash Lnp || A few 0rsquos A few infinrsquos a few NaNsbull Rest mostly O(1)

                              ndash Why Floating point is nonassociative doing arithmetic in different order gives different rounding errors

                              ndash Same experiment with rank 6 matrices || L ndash Lnp || usually nonzero O(macheps)

                              ndash Same experiment with 20x20 rank 4 matrices || L ndash Lnp || often O(103)

                              bull Much harder to break TSLU but possiblendash Occurred when using TSLU to factorize a low-rank subdiagonal

                              panel in symmetric-indefinite factorization 41

                              Fixing TSLU

                              bull Run TSLU quickly test for stability fix if necessary (rare)

                              bull Test conditioning of U if not tiny (usual case) proceed elsebull Compute || L || if not big (usual case) proceed elsebull Factor A = QR using TSQR thenbull Factor Q = PLU using TSLU thenbull A = PL(UR) with UR as upper triangular factor

                              bull Last topic in lecture how to guarantee floating point reproducibility

                              42

                              2D CALU with Tournament Pivoting

                              43

                              25D CALU with Tournament Pivoting (c=4 copies)

                              44

                              Exascale Machine ParametersSource DOE Exascale Workshop

                              bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

                              Exascale predicted speedupsfor Gaussian Elimination

                              2D CA-LU vs ScaLAPACK-LU

                              log2 (p)

                              log 2

                              (n2 p

                              ) =

                              log 2

                              (mem

                              ory_

                              per_

                              proc

                              )

                              Up to 29x

                              25D vs 2D LUWith and Without Pivoting

                              Other CA algorithms for Ax=b least squares(13)

                              bull A symmetric and indefinitendash Seek factorization that retains symmetry PAPT = LDLT D

                              ldquosimplerdquobull Save frac12 flops preserve inertia

                              ndash Usual approach Bunch-Kaufmanbull D block diagonal with 1x1 and 2x2 blocksbull Pivot search down column along row (lots of communication)

                              ndash Alternative Aasenbull D = tridiagonal = Tbull Two steps

                              ndash PAPT = LTLT where T is banded using TSLU

                              48

                              0 0

                              0

                              0 0

                              0

                              0

                              hellip

                              hellip

                              ndash Solvefactor narrow band problem with Tbull Up to 28x faster than MKL Best Paper at IPDPSrsquo13

                              Other CA algorithms for Ax=b least squares (23)bull Minimizing bandwidth and latency for sequential GEPP

                              ndash So far could not do partial pivoting and minimize messages just words

                              ndash Challengebull Column layout good for choosing pivots bad for matmulbull Blocked layout good for matmul bad for choosing pivots

                              ndash Solution use both layouts switching between thembull ldquoShape Morphing LUrdquo or SMLU

                              49

                              bull func factor(A) if A has 1 column update it else factor(left half of A)

                              update right half of A

                              factor(right half of A)

                              bull Words = O(n3M12)

                              bull Messages = O(n3M)

                              bull func factor(A) if A has 1 column update it else factor(left half of A) reshape to recursive block format update right half of A reshape to columnwise format factor(right half of A)

                              bull Words = O(n3M12)

                              bull Messages = O(n3M32)

                              Other CA algorithms for Ax=b least squares (33)bull Need for pivoting arises beyond LU in QR

                              ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)

                              ndash Usual approach like Partial Pivotingbull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound

                              ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two

                              groups of b columns either using usual approach or something better (GuEisenstat)

                              bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix

                              ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 50

                              Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                              ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                              ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                              bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                              bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                              What about sparse matrices (13)

                              bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

                              bull But canrsquot reorder outer loop for 25D need another ideabull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = AB

                              ndash Dependencies ok 25D works just different semiringbull Kleenersquos Algorithm

                              52

                              for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

                              D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

                              Performance of 25D APSP using Kleene

                              53

                              Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

                              62xspeedup

                              2x speedup

                              What about sparse matrices (23)

                              bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of G(A)

                              have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

                              bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering for

                              2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

                              bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

                              multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

                              separators)

                              54

                              What about sparse matrices (33)

                              bull If matrix stays very sparse lower bound unattainable new one bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

                              data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

                              bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

                              Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

                              bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

                              along dimensions most likely to minimize cost55

                              Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                              ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                              ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                              bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                              bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                              Symmetric Eigenproblem and SVD

                              bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd

                              bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12

                              ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea

                              b+1

                              b+1

                              Successive Band Reduction (BischofLangSun)

                              1

                              b+1

                              b+1

                              d+1

                              c

                              Successive Band Reduction (BischofLangSun)

                              b = bandwidthc = columnsd = diagonalsConstraint c+d b

                              1Q1

                              b+1

                              b+1

                              d+1

                              c

                              b = bandwidthc = columnsd = diagonalsConstraint c+d b

                              Successive Band Reduction (BischofLangSun)

                              12

                              Q1

                              b+1

                              b+1

                              d+1

                              d+c

                              d+c

                              c

                              b = bandwidthc = columnsd = diagonalsConstraint c+d b

                              Successive Band Reduction (BischofLangSun)

                              1

                              12

                              Q1

                              Q1T

                              b+1

                              b+1

                              d+1

                              d+1

                              cd+c

                              d+c

                              c

                              b = bandwidthc = columnsd = diagonalsConstraint c+d b

                              Successive Band Reduction (BischofLangSun)

                              1

                              1

                              2

                              2Q1

                              Q1T

                              b+1

                              b+1

                              d+1

                              d+1

                              cd+c

                              d+c

                              d+c

                              d+c

                              c

                              b = bandwidthc = columnsd = diagonalsConstraint c+d b

                              Successive Band Reduction (BischofLangSun)

                              1

                              1

                              2

                              2

                              3

                              3

                              Q1

                              Q1T

                              Q2

                              Q2T

                              b+1

                              b+1

                              d+1

                              d+1

                              d+c

                              d+c

                              d+c

                              d+c

                              c

                              c

                              b = bandwidthc = columnsd = diagonalsConstraint c+d b

                              Successive Band Reduction (BischofLangSun)

                              1

                              1

                              2

                              2

                              3

                              3

                              4

                              4

                              Q1

                              Q1T

                              Q2

                              Q2T

                              Q3

                              Q3T

                              b+1

                              b+1

                              d+1

                              d+1

                              d+c

                              d+c

                              d+c

                              d+c

                              c

                              c

                              b = bandwidthc = columnsd = diagonalsConstraint c+d b

                              Successive Band Reduction (BischofLangSun)

                              1

                              1

                              2

                              2

                              3

                              3

                              4

                              4

                              5

                              5

                              Q1

                              Q1T

                              Q2

                              Q2T

                              Q3

                              Q3T

                              Q4

                              Q4T

                              b+1

                              b+1

                              d+1

                              d+1

                              c

                              c

                              d+c

                              d+c

                              d+c

                              d+c

                              b = bandwidthc = columnsd = diagonalsConstraint c+d b

                              Successive Band Reduction (BischofLangSun)

                              1

                              1

                              2

                              2

                              3

                              3

                              4

                              4

                              5

                              5

                              Q5T

                              Q1

                              Q1T

                              Q2

                              Q2T

                              Q3

                              Q3T

                              Q5

                              Q4

                              Q4T

                              b+1

                              b+1

                              d+1

                              d+1

                              c

                              c

                              d+c

                              d+c

                              d+c

                              d+c

                              b = bandwidthc = columnsd = diagonalsConstraint c+d b

                              Successive Band Reduction (BischofLangSun)

                              1

                              1

                              2

                              2

                              3

                              3

                              4

                              4

                              5

                              5

                              6

                              6

                              Q5T

                              Q1

                              Q1T

                              Q2

                              Q2T

                              Q3

                              Q3T

                              Q5

                              Q4

                              Q4T

                              b+1

                              b+1

                              d+1

                              d+1

                              c

                              c

                              d+c

                              d+c

                              d+c

                              d+c

                              b = bandwidthc = columnsd = diagonalsConstraint c+d b

                              Successive Band Reduction (BischofLangSun)

                              Conventional vs CA - SBR

                              Conventional Communication-Avoiding

                              Touch all data 4 times Touch all data once

                              >
                              >

                              Speedups of Sym Band Reductionvs DSBTRD

                              bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

                              bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

                              bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

                              bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

                              bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

                              Nonsymmetric Eigenproblem

                              bull No apparent way to modify standard algorithmbull Instead Spectral Divide-and-Conquer

                              ndash Find orthogonal matrix Q whose leading columns span an invariant subspace of A

                              ndash QTAQ will be block upper triangularndash Apply recursively to A11 A22

                              ndash Depends on randomization1 Randomized Rank Revealing QR decomposition2 Randomized location to try splitting spectrum

                              A11 A12

                              ε A22

                              Attaining the Lower bounds SequentialLegend[Existing][Ours][Math-Lib][Random]

                              Two Levels Memory Hierarchy

                              Words Messages Words Messages

                              BLAS-3 [FLPRrsquo99][BDLSTrsquo13][MKL etc] [FLPRrsquo99][BDLSTrsquo13][MKL etc]

                              Cholesky[Grsquo97][APrsquo00]

                              [LAPACK][BDHSrsquo09]

                              [Grsquo97][APrsquo00][BDHSrsquo09] [Grsquo97][APrsquo00][BDHSrsquo09]

                              Sym Indefinite [BBDDDPSTYrsquo13] [BBDDDPSTYrsquo13]

                              LU[Grsquo97][Trsquo97]

                              [GDXrsquo11][BDLSTrsquo13]

                              [GDXrsquo11][BDLSTrsquo13]

                              [Grsquo97][Trsquo97] [BDLSTrsquo13] [BDLSTrsquo13]

                              QR[EGrsquo98][FWrsquo03]

                              [DGHLrsquo12][BDLSTrsquo13]

                              [FWrsquo03][DGHLrsquo12][BDLSTrsquo13]

                              [EGrsquo98][FWrsquo03][BDLSTrsquo13]

                              [FWrsquo03][BDLSTrsquo13]

                              Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

                              Sym Eig amp SVD [BDDrsquo11][BDKrsquo13] [BDDrsquo11]

                              Non Sym Eig [BDDrsquo11] [BDDrsquo11]

                              Legend[Existing][Ours][Math-Lib][Random]

                              Words (BW) Messages (L) Saving factor

                              BLAS-3 [AGZrsquo94][MTrsquo99][ScaLAPACK][Crsquo69][vGWrsquo97][SDrsquo11] L nP12

                              Cholesky [ScaLAPACK][Trsquo99][SDrsquo11] L nP12

                              Sym Indefinite [BBDDDPSTYrsquo13][ScaLAPACK] [BBDDDPSTYrsquo13] L nP12

                              LU [ScaLAPACK][GDXrsquo11][Trsquo99][SDrsquo11] [GDXrsquo11][Trsquo99][SDrsquo11] L nP12

                              QR [ScaLAPACK][DGHLrsquo12] [Trsquo99] [DGHLrsquo12][Trsquo99] L nP12

                              Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

                              Sym Eig amp SVD [BDDrsquo11][BDKrsquo13][ScaLAPACK] [BDDrsquo11][BDKrsquo13] L nP12

                              Non-Sym Eig [BDDrsquo11] [BDDrsquo11] BW P12 L n

                              Attaining with extra memory 25D M=(cn2P)

                              Attaining the Lower bounds Parallel 2DM=(n2P)(Ignoring poly-log(P) factors words = ( n2 P12) messages = (P12)

                              Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                              ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                              ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                              bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                              bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                              Avoiding Communication in Iterative Linear Algebra

                              bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

                              bull Conjugate Gradients (CG) GMRES Lanczos Arnoldi hellip bull Goal minimize communication

                              ndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

                              bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

                              ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

                              bull Lots of speed up possible (modeled and measured)ndash Price some redundant computationndash Challenges Poor partitioning Preconditioning Num Stability

                              75

                              Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                              ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                              ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                              bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                              bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                              Example The Difficulty of Tuning SpMV

                              bull n = 21200bull nnz = 15 M

                              bull Source NASA structural analysis problem (raefsky)

                              77

                              Example The Difficulty of Tuning

                              bull n = 21200bull nnz = 15 M

                              bull Source NASA structural analysis problem (raefsky)

                              bull 8x8 dense substructure exploit this to limit mem_refs

                              78

                              Speedups on Itanium 2 The Need for Search

                              Reference

                              Best 4x2

                              Mflops

                              Mflops

                              79

                              Register Profile Itanium 2

                              190 Mflops

                              1190 Mflops

                              80

                              Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16

                              Itanium 2 - 33Itanium 1 - 8

                              252 Mflops

                              122 Mflops

                              820 Mflops

                              459 Mflops

                              247 Mflops

                              107 Mflops

                              12 Gflops

                              190 Mflops

                              Another example of tuning challenges for SpMV

                              bull Ex11 matrix (fluid flow)

                              bull More complicated non-zero structure in general

                              bull N = 16614bull NNZ = 11M

                              82

                              Zoom in to top corner

                              bull More complicated non-zero structure in general

                              bull N = 16614bull NNZ = 11M

                              83

                              3x3 blocks look natural buthellip

                              bull Example 3x3 blockingndash Logical grid of 3x3 cells

                              bull But would lead to lots of ldquofill-inrdquo

                              84

                              Extra Work Can Improve Efficiency

                              bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

                              bull On Pentium III 15x speedupndash Actual mflop rate 152 = 225 higher

                              85

                              Source Accelerator Cavity Design Problem (Ko via Husbands)

                              86

                              100x100 Submatrix Along Diagonal

                              Summer School Lecture 787

                              Post-RCM Reordering

                              88

                              Effect of Combined RCM+TSP Reordering

                              Before Green + RedAfter Green + Blue

                              Summer School Lecture 789

                              2x speedups on Pentium 4 Power 4 hellip

                              Summary of Other Performance Optimizations

                              bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

                              bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

                              bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

                              90

                              Optimized Sparse Kernel Interface - OSKI

                              bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

                              bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

                              bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

                              software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

                              91

                              Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                              ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                              ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                              bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                              bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                              93

                              Example Classical Conjugate Gradient (CG)

                              SpMVs and dot products require communication in

                              each iteration

                              via CA Matrix Powers Kernel

                              Global reduction to compute G

                              94

                              Example CA-Conjugate Gradient

                              Local computations within inner loop require

                              no communication

                              Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                              ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                              ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                              bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                              bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                              96

                              Slower convergence due

                              to roundoff

                              Loss of accuracy due to roundoff

                              At s = 16 monomial basis is rank deficient Method breaks down

                              Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

                              CA-CG (monomial)CG

                              machine precision

                              97

                              Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                              ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                              ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                              bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                              bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                              What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

                              bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

                              bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

                              matrices

                              Explicit (O(nnz)) Implicit (o(nnz))

                              Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

                              Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

                              Indices

                              Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                              ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                              ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                              bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                              bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                              101

                              bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

                              ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

                              demanded by customers (construction engineers) otherwise they donrsquot believe results

                              ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

                              ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

                              Reproducible Floating Point Computation

                              Absolute Error for Random Vectors

                              Same magnitude opposite signs

                              Intel MKL non-reproducibility

                              Relative Error for Orthogonal vectors

                              Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

                              Sign notreproducible

                              103

                              bull Consider summation or dot productbull Goals

                              1 Same answer independent of layout processors order of summands

                              2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

                              bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

                              GoalsApproaches for Reproducibility

                              104

                              Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

                              Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

                              Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

                              bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

                              Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

                              bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

                              Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

                              Instruments NEC Nokia NVIDIA Samsung Oracle

                              bull bebopcsberkeleyedu

                              Summary

                              Donrsquot Communichellip

                              106

                              Time to redesign all linear algebra n-body hellip algorithms and software

                              (and compilers)

                              • Implementing Communication-Avoiding Algorithms
                              • Why avoid communication
                              • Goals
                              • Outline
                              • Outline (2)
                              • Lower bound for all ldquon3-likerdquo linear algebra
                              • Lower bound for all ldquon3-likerdquo linear algebra (2)
                              • Lower bound for all ldquon3-likerdquo linear algebra (3)
                              • Limits to parallel scaling (12)
                              • Limits to parallel scaling (22)
                              • Can we attain these lower bounds
                              • Outline (3)
                              • 25D Matrix Multiplication
                              • 25D Matrix Multiplication (2)
                              • 25D Matmul on BGP 16K nodes 64K cores (2)
                              • Perfect Strong Scaling ndash in Time and Energy (12)
                              • Perfect Strong Scaling ndash in Time and Energy (22)
                              • Handling Heterogeneity
                              • Application to Tensor Contractions
                              • C(ijk) = Σm A(ijm)B(mk)
                              • Application to Tensor Contractions (2)
                              • Communication Lower Bounds for Strassen-like matmul algorithms
                              • vs
                              • Slide 26
                              • Strassen-like beyond matmul
                              • Cache and Network Oblivious Algorithms
                              • CARMA Performance Distributed Memory
                              • CARMA Performance Distributed Memory (2)
                              • CARMA Performance Shared Memory
                              • CARMA Performance Shared Memory (2)
                              • Why is CARMA Faster in Shared Memory
                              • Outline (4)
                              • One-sided Factorizations (LU QR) so far
                              • TSQR An Architecture-Dependent Algorithm
                              • Back to LU Using similar idea for TSLU as TSQR Use reduction
                              • Minimizing Communication in TSLU
                              • Making TSLU Numerically Stable
                              • Stability of LU using TSLU CALU
                              • Why is stability of TSLU just a ldquoThmrdquo
                              • Fixing TSLU
                              • 2D CALU with Tournament Pivoting
                              • 25D CALU with Tournament Pivoting (c=4 copies)
                              • Exascale Machine Parameters Source DOE Exascale Workshop
                              • Exascale predicted speedups for Gaussian Elimination 2D CA
                              • 25D vs 2D LU With and Without Pivoting
                              • Other CA algorithms for Ax=b least squares(13)
                              • Other CA algorithms for Ax=b least squares (23)
                              • Other CA algorithms for Ax=b least squares (33)
                              • Outline (5)
                              • What about sparse matrices (13)
                              • Performance of 25D APSP using Kleene
                              • What about sparse matrices (23)
                              • What about sparse matrices (33)
                              • Outline (6)
                              • Symmetric Eigenproblem and SVD
                              • Slide 58
                              • Slide 59
                              • Slide 60
                              • Slide 61
                              • Slide 62
                              • Slide 63
                              • Slide 64
                              • Slide 65
                              • Slide 66
                              • Slide 67
                              • Slide 68
                              • Conventional vs CA - SBR
                              • Speedups of Sym Band Reduction vs DSBTRD
                              • Nonsymmetric Eigenproblem
                              • Attaining the Lower bounds Sequential
                              • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
                              • Outline (7)
                              • Avoiding Communication in Iterative Linear Algebra
                              • Outline (8)
                              • Example The Difficulty of Tuning SpMV
                              • Example The Difficulty of Tuning
                              • Speedups on Itanium 2 The Need for Search
                              • Register Profile Itanium 2
                              • Register Profiles IBM and Intel IA-64
                              • Another example of tuning challenges for SpMV
                              • Zoom in to top corner
                              • 3x3 blocks look natural buthellip
                              • Extra Work Can Improve Efficiency
                              • Slide 86
                              • Slide 87
                              • Slide 88
                              • Slide 89
                              • Summary of Other Performance Optimizations
                              • Optimized Sparse Kernel Interface - OSKI
                              • Outline (9)
                              • Example Classical Conjugate Gradient (CG)
                              • Example CA-Conjugate Gradient
                              • Outline (10)
                              • Slide 96
                              • Slide 97
                              • Outline (11)
                              • What is a ldquosparse matrixrdquo
                              • Outline (12)
                              • Reproducible Floating Point Computation
                              • Intel MKL non-reproducibility
                              • GoalsApproaches for Reproducibility
                              • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
                              • Collaborators and Supporters
                              • Summary

                                Perfect Strong Scaling ndash in Time and Energy (12)

                                bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2

                                bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model

                                ndash γT βT αT = secs per flop per word_moved per message of size m

                                bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ] = T(P)cbull Notation for energy model

                                ndash γE βE αE = joules for same operations

                                ndash δE = joules per word of memory used per sec

                                ndash εE = joules per sec for leakage etc

                                bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP) = E(P)bull Perfect scaling extends to N-body Strassen hellip

                                Perfect Strong Scaling ndash in Time and Energy (22)

                                bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ] = T(P)cbull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP) = E(P)

                                bull Can use these formulas to answer many questions such asndash How to choose p and M to minimize energy E needed for computationndash Given max allowed runtime T what is minimum energy E needed to achieve itndash Given max allowed energy E what is the minimum runtime T attainablendash Can we minimize the average power P = ETndash Given target energy efficiency what architectural parameters are needed to

                                achieve itbull Can we attain 75 GflopsWattbull Can we attain an exaflop for 20 MWatts

                                Handling Heterogeneitybull Suppose each of P processors could differ

                                ndash γi = secflop βi = secword αi = secmessage Mi = memory

                                bull What is optimal assignment of work Fi to minimize timendash Ti = Fi γi + Fi βi Mi

                                12 + Fi αi Mi32 = Fi [γi + βi Mi

                                12 + αi Mi32] = Fi ξi

                                ndash Choose Fi so Σi Fi = n3 and minimizing T = maxi Ti

                                ndash Answer Fi = n3(1ξi)Σj(1ξj) and T = n3Σj(1ξj)

                                bull Optimal Algorithm for nxn matmulndash Recursively divide into 8 half-sized subproblemsndash Assign subproblems to processor i to add up to Fi flops

                                bull Works for Strassen other algorithmshellip

                                Application to Tensor Contractions

                                bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

                                bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

                                bull Heavily used in electronic structure calculationsndash Ex NWChem

                                bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetries

                                ndash Solomonik Hammond Matthews

                                C(ijk) = Σm A(ijm)B(mk)

                                A3-fold symm

                                B2-fold symm

                                C2-fold symm

                                Application to Tensor Contractions

                                bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

                                bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

                                bull Heavily used in electronic structure calculationsndash Ex NWChem for coupled cluster (CC) approach to Schroedinger eqn

                                bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetriesndash Up to 3x faster running CC than NWChem on 3072 cores of Cray XE6 ndash Solomonik Hammond Matthews

                                Communication Lower Bounds for Strassen-like matmul algorithms

                                bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

                                bull Extends up to M = n2 p2ω bull Extends to rectangular case multiply (mxn)(nxp) in q mults

                                ndash words_moved = Ω (flopsM^(logmpq -1))

                                bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz also in JACMbull Is the lower bound attainable

                                Classical O(n3) matmul

                                words_moved =Ω (M(nM12)3P)

                                Strassenrsquos O(nlg7) matmul

                                words_moved =Ω (M(nM12)lg7P)

                                Strassen-like O(nω) matmul

                                words_moved =Ω (M(nM12)ωP)

                                vs

                                Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

                                Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

                                CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

                                Communication Avoiding Parallel Strassen (CAPS)

                                Best way to interleaveBFS and DFS is an tuning parameter

                                26

                                Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

                                Speedups 24-184(over previous Strassen-based algorithms)

                                Invited to appear as Research Highlight in CACM

                                Strassen-like beyond matmul

                                bull Thm (D Dumitriu Holtzrsquo07) Any Strassen-like O(nω) matmul algorithm can be used to build a numerically stable O(nω+η) algorithm for any ηgt0 for Ax=b least squares eig SVD hellipndash ηgt0 needed to deal with numerical stabilityndash Strassen already stable so η=0

                                bull Thm For sequential versions of these algorithms Words_moved = O(nω+ηM(ω+η)2 ndash 1 + n2 log n) ie attain expected lower bound

                                Ballard D Holtz Schwartz

                                Cache and Network Oblivious Algorithms

                                bull Motivation Minimizes communication at every level of a hierarchical system without tuning parameters (in theory)ndash Not always 25D Matmul on BGP was topology aware

                                bull CAPS Divide-and-conquer choose BFS or DFS to adapt to processors available memory

                                bull CARMAndash Divide-and-conquer classical matmul divide largest of 3

                                dimensions to create two subproblemsndash Choose BFS or DFS to adapt to processors available memory

                                CARMA Performance Distributed Memory

                                Square m = k = n = 6144

                                ScaLAPACK

                                CARMA

                                Peak

                                (log)

                                (log)

                                Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA

                                CARMA Performance Distributed Memory

                                Inner Product m = n = 192 k = 6291456

                                ScaLAPACK

                                CARMAPeak

                                (log)

                                (log)

                                Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA

                                CARMA Performance Shared Memory

                                Square m = k = n

                                MKL (double)CARMA (double)

                                MKL (single)CARMA (single)

                                Peak (single)

                                Peak (double)

                                (log)

                                (linear)

                                Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

                                CARMA Performance Shared Memory

                                Inner Product m = n = 64

                                MKL (double)

                                CARMA (double)

                                MKL (single)

                                CARMA (single)

                                (log)

                                (linear)

                                Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

                                Why is CARMA Faster in Shared MemoryL3 Cache Misses

                                Shared Memory Inner Product (m = n = 64 k = 524288)

                                97 Fewer Misses

                                86 Fewer Misses

                                (linear)

                                Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                One-sided Factorizations (LU QR) so farbull Classical Approach for i=1 to n update column i update trailing matrixbull words_moved = O(n3)

                                35

                                bull Blocked Approach (LAPACK) for i=1 to nb update block i of b columns update trailing matrixbull words moved = O(n3M13)

                                bull Recursive Approach func factor(A) if A has 1 column update it

                                else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)

                                bull None of these approaches minimizes messagesbull Parallel case Partial

                                Pivoting =gt n reductionsbull Need another idea

                                TSQR An Architecture-Dependent Algorithm

                                W =

                                W0

                                W1

                                W2

                                W3

                                R00

                                R10

                                R20

                                R30

                                R01

                                R11

                                R02Parallel

                                W =

                                W0

                                W1

                                W2

                                W3

                                R01 R02

                                R00

                                R03

                                SequentialStreaming

                                W =

                                W0

                                W1

                                W2

                                W3

                                R00

                                R01R01

                                R11

                                R02

                                R11

                                R03

                                Dual Core

                                Can choose reduction tree dynamicallyMulticore Multisocket Multirack Multisite Out-of-core

                                Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

                                Wnxb =

                                W1

                                W2

                                W3

                                W4

                                P1middotL1middotU1

                                P2middotL2middotU2

                                P3middotL3middotU3

                                P4middotL4middotU4

                                =

                                Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

                                W1rsquoW2rsquoW3rsquoW4rsquo

                                P12middotL12middotU12

                                P34middotL34middotU34

                                =Choose b pivot rows call them W12rsquo

                                Choose b pivot rows call them W34rsquo

                                W12rsquoW34rsquo

                                = P1234middotL1234middotU1234

                                Choose b pivot rows

                                Go back to W and use these b pivot rows (move them to top do LU without pivoting)

                                37

                                Minimizing Communication in TSLU

                                W = W1

                                W2

                                W3

                                W4

                                LULULULU

                                LU

                                LULUParallel

                                W = W1

                                W2

                                W3

                                W4

                                LULU

                                LU

                                LUSequentialStreaming

                                W = W1

                                W2

                                W3

                                W4

                                LULU LU

                                LULU

                                LULU

                                Dual Core

                                Can choose reduction tree dynamically to match architecture as before

                                38

                                Making TSLU Numerically Stable

                                bull Details matterndash Going up the tree we could do LU either on original rows of A

                                (tournament pivoting) or computed rows of Undash Only tournament pivoting stable

                                bull ldquoThmrdquo New scheme as stable as Partial Pivoting (GEPP) in following sense Get same Schur complements as GEPP applied to different input matrix whose entries are blocks taken from input A

                                bull Why just a ldquoThmrdquo

                                39

                                Stability of LU using TSLU CALU

                                Summer School Lecture 4 40

                                bull Empirical testingndash Both random matrices and ldquospecial onesrdquondash Both binary tree (BCALU) and flat-tree (FCALU)ndash 3 metrics ||PA-LU||||A|| normwise and componentwise backward errorsndash See [D Grigori Xiang 2010] for details

                                Why is stability of TSLU just a ldquoThmrdquo

                                bull Proof is correct ndash in exact arithmeticbull Experiment

                                ndash Generate 100 random 6x6 rank 3 matrices in Matlabndash [LUP] = lu(A) do LU without pivoting on PA compare L factors are

                                they the samebull Compute || L ndash Lnp || A few 0rsquos A few infinrsquos a few NaNsbull Rest mostly O(1)

                                ndash Why Floating point is nonassociative doing arithmetic in different order gives different rounding errors

                                ndash Same experiment with rank 6 matrices || L ndash Lnp || usually nonzero O(macheps)

                                ndash Same experiment with 20x20 rank 4 matrices || L ndash Lnp || often O(103)

                                bull Much harder to break TSLU but possiblendash Occurred when using TSLU to factorize a low-rank subdiagonal

                                panel in symmetric-indefinite factorization 41

                                Fixing TSLU

                                bull Run TSLU quickly test for stability fix if necessary (rare)

                                bull Test conditioning of U if not tiny (usual case) proceed elsebull Compute || L || if not big (usual case) proceed elsebull Factor A = QR using TSQR thenbull Factor Q = PLU using TSLU thenbull A = PL(UR) with UR as upper triangular factor

                                bull Last topic in lecture how to guarantee floating point reproducibility

                                42

                                2D CALU with Tournament Pivoting

                                43

                                25D CALU with Tournament Pivoting (c=4 copies)

                                44

                                Exascale Machine ParametersSource DOE Exascale Workshop

                                bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

                                Exascale predicted speedupsfor Gaussian Elimination

                                2D CA-LU vs ScaLAPACK-LU

                                log2 (p)

                                log 2

                                (n2 p

                                ) =

                                log 2

                                (mem

                                ory_

                                per_

                                proc

                                )

                                Up to 29x

                                25D vs 2D LUWith and Without Pivoting

                                Other CA algorithms for Ax=b least squares(13)

                                bull A symmetric and indefinitendash Seek factorization that retains symmetry PAPT = LDLT D

                                ldquosimplerdquobull Save frac12 flops preserve inertia

                                ndash Usual approach Bunch-Kaufmanbull D block diagonal with 1x1 and 2x2 blocksbull Pivot search down column along row (lots of communication)

                                ndash Alternative Aasenbull D = tridiagonal = Tbull Two steps

                                ndash PAPT = LTLT where T is banded using TSLU

                                48

                                0 0

                                0

                                0 0

                                0

                                0

                                hellip

                                hellip

                                ndash Solvefactor narrow band problem with Tbull Up to 28x faster than MKL Best Paper at IPDPSrsquo13

                                Other CA algorithms for Ax=b least squares (23)bull Minimizing bandwidth and latency for sequential GEPP

                                ndash So far could not do partial pivoting and minimize messages just words

                                ndash Challengebull Column layout good for choosing pivots bad for matmulbull Blocked layout good for matmul bad for choosing pivots

                                ndash Solution use both layouts switching between thembull ldquoShape Morphing LUrdquo or SMLU

                                49

                                bull func factor(A) if A has 1 column update it else factor(left half of A)

                                update right half of A

                                factor(right half of A)

                                bull Words = O(n3M12)

                                bull Messages = O(n3M)

                                bull func factor(A) if A has 1 column update it else factor(left half of A) reshape to recursive block format update right half of A reshape to columnwise format factor(right half of A)

                                bull Words = O(n3M12)

                                bull Messages = O(n3M32)

                                Other CA algorithms for Ax=b least squares (33)bull Need for pivoting arises beyond LU in QR

                                ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)

                                ndash Usual approach like Partial Pivotingbull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound

                                ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two

                                groups of b columns either using usual approach or something better (GuEisenstat)

                                bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix

                                ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 50

                                Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                What about sparse matrices (13)

                                bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

                                bull But canrsquot reorder outer loop for 25D need another ideabull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = AB

                                ndash Dependencies ok 25D works just different semiringbull Kleenersquos Algorithm

                                52

                                for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

                                D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

                                Performance of 25D APSP using Kleene

                                53

                                Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

                                62xspeedup

                                2x speedup

                                What about sparse matrices (23)

                                bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of G(A)

                                have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

                                bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering for

                                2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

                                bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

                                multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

                                separators)

                                54

                                What about sparse matrices (33)

                                bull If matrix stays very sparse lower bound unattainable new one bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

                                data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

                                bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

                                Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

                                bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

                                along dimensions most likely to minimize cost55

                                Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                Symmetric Eigenproblem and SVD

                                bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd

                                bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12

                                ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea

                                b+1

                                b+1

                                Successive Band Reduction (BischofLangSun)

                                1

                                b+1

                                b+1

                                d+1

                                c

                                Successive Band Reduction (BischofLangSun)

                                b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                1Q1

                                b+1

                                b+1

                                d+1

                                c

                                b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                Successive Band Reduction (BischofLangSun)

                                12

                                Q1

                                b+1

                                b+1

                                d+1

                                d+c

                                d+c

                                c

                                b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                Successive Band Reduction (BischofLangSun)

                                1

                                12

                                Q1

                                Q1T

                                b+1

                                b+1

                                d+1

                                d+1

                                cd+c

                                d+c

                                c

                                b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                Successive Band Reduction (BischofLangSun)

                                1

                                1

                                2

                                2Q1

                                Q1T

                                b+1

                                b+1

                                d+1

                                d+1

                                cd+c

                                d+c

                                d+c

                                d+c

                                c

                                b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                Successive Band Reduction (BischofLangSun)

                                1

                                1

                                2

                                2

                                3

                                3

                                Q1

                                Q1T

                                Q2

                                Q2T

                                b+1

                                b+1

                                d+1

                                d+1

                                d+c

                                d+c

                                d+c

                                d+c

                                c

                                c

                                b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                Successive Band Reduction (BischofLangSun)

                                1

                                1

                                2

                                2

                                3

                                3

                                4

                                4

                                Q1

                                Q1T

                                Q2

                                Q2T

                                Q3

                                Q3T

                                b+1

                                b+1

                                d+1

                                d+1

                                d+c

                                d+c

                                d+c

                                d+c

                                c

                                c

                                b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                Successive Band Reduction (BischofLangSun)

                                1

                                1

                                2

                                2

                                3

                                3

                                4

                                4

                                5

                                5

                                Q1

                                Q1T

                                Q2

                                Q2T

                                Q3

                                Q3T

                                Q4

                                Q4T

                                b+1

                                b+1

                                d+1

                                d+1

                                c

                                c

                                d+c

                                d+c

                                d+c

                                d+c

                                b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                Successive Band Reduction (BischofLangSun)

                                1

                                1

                                2

                                2

                                3

                                3

                                4

                                4

                                5

                                5

                                Q5T

                                Q1

                                Q1T

                                Q2

                                Q2T

                                Q3

                                Q3T

                                Q5

                                Q4

                                Q4T

                                b+1

                                b+1

                                d+1

                                d+1

                                c

                                c

                                d+c

                                d+c

                                d+c

                                d+c

                                b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                Successive Band Reduction (BischofLangSun)

                                1

                                1

                                2

                                2

                                3

                                3

                                4

                                4

                                5

                                5

                                6

                                6

                                Q5T

                                Q1

                                Q1T

                                Q2

                                Q2T

                                Q3

                                Q3T

                                Q5

                                Q4

                                Q4T

                                b+1

                                b+1

                                d+1

                                d+1

                                c

                                c

                                d+c

                                d+c

                                d+c

                                d+c

                                b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                Successive Band Reduction (BischofLangSun)

                                Conventional vs CA - SBR

                                Conventional Communication-Avoiding

                                Touch all data 4 times Touch all data once

                                >
                                >

                                Speedups of Sym Band Reductionvs DSBTRD

                                bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

                                bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

                                bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

                                bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

                                bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

                                Nonsymmetric Eigenproblem

                                bull No apparent way to modify standard algorithmbull Instead Spectral Divide-and-Conquer

                                ndash Find orthogonal matrix Q whose leading columns span an invariant subspace of A

                                ndash QTAQ will be block upper triangularndash Apply recursively to A11 A22

                                ndash Depends on randomization1 Randomized Rank Revealing QR decomposition2 Randomized location to try splitting spectrum

                                A11 A12

                                ε A22

                                Attaining the Lower bounds SequentialLegend[Existing][Ours][Math-Lib][Random]

                                Two Levels Memory Hierarchy

                                Words Messages Words Messages

                                BLAS-3 [FLPRrsquo99][BDLSTrsquo13][MKL etc] [FLPRrsquo99][BDLSTrsquo13][MKL etc]

                                Cholesky[Grsquo97][APrsquo00]

                                [LAPACK][BDHSrsquo09]

                                [Grsquo97][APrsquo00][BDHSrsquo09] [Grsquo97][APrsquo00][BDHSrsquo09]

                                Sym Indefinite [BBDDDPSTYrsquo13] [BBDDDPSTYrsquo13]

                                LU[Grsquo97][Trsquo97]

                                [GDXrsquo11][BDLSTrsquo13]

                                [GDXrsquo11][BDLSTrsquo13]

                                [Grsquo97][Trsquo97] [BDLSTrsquo13] [BDLSTrsquo13]

                                QR[EGrsquo98][FWrsquo03]

                                [DGHLrsquo12][BDLSTrsquo13]

                                [FWrsquo03][DGHLrsquo12][BDLSTrsquo13]

                                [EGrsquo98][FWrsquo03][BDLSTrsquo13]

                                [FWrsquo03][BDLSTrsquo13]

                                Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

                                Sym Eig amp SVD [BDDrsquo11][BDKrsquo13] [BDDrsquo11]

                                Non Sym Eig [BDDrsquo11] [BDDrsquo11]

                                Legend[Existing][Ours][Math-Lib][Random]

                                Words (BW) Messages (L) Saving factor

                                BLAS-3 [AGZrsquo94][MTrsquo99][ScaLAPACK][Crsquo69][vGWrsquo97][SDrsquo11] L nP12

                                Cholesky [ScaLAPACK][Trsquo99][SDrsquo11] L nP12

                                Sym Indefinite [BBDDDPSTYrsquo13][ScaLAPACK] [BBDDDPSTYrsquo13] L nP12

                                LU [ScaLAPACK][GDXrsquo11][Trsquo99][SDrsquo11] [GDXrsquo11][Trsquo99][SDrsquo11] L nP12

                                QR [ScaLAPACK][DGHLrsquo12] [Trsquo99] [DGHLrsquo12][Trsquo99] L nP12

                                Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

                                Sym Eig amp SVD [BDDrsquo11][BDKrsquo13][ScaLAPACK] [BDDrsquo11][BDKrsquo13] L nP12

                                Non-Sym Eig [BDDrsquo11] [BDDrsquo11] BW P12 L n

                                Attaining with extra memory 25D M=(cn2P)

                                Attaining the Lower bounds Parallel 2DM=(n2P)(Ignoring poly-log(P) factors words = ( n2 P12) messages = (P12)

                                Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                Avoiding Communication in Iterative Linear Algebra

                                bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

                                bull Conjugate Gradients (CG) GMRES Lanczos Arnoldi hellip bull Goal minimize communication

                                ndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

                                bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

                                ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

                                bull Lots of speed up possible (modeled and measured)ndash Price some redundant computationndash Challenges Poor partitioning Preconditioning Num Stability

                                75

                                Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                Example The Difficulty of Tuning SpMV

                                bull n = 21200bull nnz = 15 M

                                bull Source NASA structural analysis problem (raefsky)

                                77

                                Example The Difficulty of Tuning

                                bull n = 21200bull nnz = 15 M

                                bull Source NASA structural analysis problem (raefsky)

                                bull 8x8 dense substructure exploit this to limit mem_refs

                                78

                                Speedups on Itanium 2 The Need for Search

                                Reference

                                Best 4x2

                                Mflops

                                Mflops

                                79

                                Register Profile Itanium 2

                                190 Mflops

                                1190 Mflops

                                80

                                Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16

                                Itanium 2 - 33Itanium 1 - 8

                                252 Mflops

                                122 Mflops

                                820 Mflops

                                459 Mflops

                                247 Mflops

                                107 Mflops

                                12 Gflops

                                190 Mflops

                                Another example of tuning challenges for SpMV

                                bull Ex11 matrix (fluid flow)

                                bull More complicated non-zero structure in general

                                bull N = 16614bull NNZ = 11M

                                82

                                Zoom in to top corner

                                bull More complicated non-zero structure in general

                                bull N = 16614bull NNZ = 11M

                                83

                                3x3 blocks look natural buthellip

                                bull Example 3x3 blockingndash Logical grid of 3x3 cells

                                bull But would lead to lots of ldquofill-inrdquo

                                84

                                Extra Work Can Improve Efficiency

                                bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

                                bull On Pentium III 15x speedupndash Actual mflop rate 152 = 225 higher

                                85

                                Source Accelerator Cavity Design Problem (Ko via Husbands)

                                86

                                100x100 Submatrix Along Diagonal

                                Summer School Lecture 787

                                Post-RCM Reordering

                                88

                                Effect of Combined RCM+TSP Reordering

                                Before Green + RedAfter Green + Blue

                                Summer School Lecture 789

                                2x speedups on Pentium 4 Power 4 hellip

                                Summary of Other Performance Optimizations

                                bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

                                bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

                                bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

                                90

                                Optimized Sparse Kernel Interface - OSKI

                                bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

                                bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

                                bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

                                software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

                                91

                                Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                93

                                Example Classical Conjugate Gradient (CG)

                                SpMVs and dot products require communication in

                                each iteration

                                via CA Matrix Powers Kernel

                                Global reduction to compute G

                                94

                                Example CA-Conjugate Gradient

                                Local computations within inner loop require

                                no communication

                                Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                96

                                Slower convergence due

                                to roundoff

                                Loss of accuracy due to roundoff

                                At s = 16 monomial basis is rank deficient Method breaks down

                                Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

                                CA-CG (monomial)CG

                                machine precision

                                97

                                Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

                                bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

                                bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

                                matrices

                                Explicit (O(nnz)) Implicit (o(nnz))

                                Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

                                Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

                                Indices

                                Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                101

                                bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

                                ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

                                demanded by customers (construction engineers) otherwise they donrsquot believe results

                                ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

                                ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

                                Reproducible Floating Point Computation

                                Absolute Error for Random Vectors

                                Same magnitude opposite signs

                                Intel MKL non-reproducibility

                                Relative Error for Orthogonal vectors

                                Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

                                Sign notreproducible

                                103

                                bull Consider summation or dot productbull Goals

                                1 Same answer independent of layout processors order of summands

                                2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

                                bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

                                GoalsApproaches for Reproducibility

                                104

                                Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

                                Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

                                Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

                                bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

                                Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

                                bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

                                Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

                                Instruments NEC Nokia NVIDIA Samsung Oracle

                                bull bebopcsberkeleyedu

                                Summary

                                Donrsquot Communichellip

                                106

                                Time to redesign all linear algebra n-body hellip algorithms and software

                                (and compilers)

                                • Implementing Communication-Avoiding Algorithms
                                • Why avoid communication
                                • Goals
                                • Outline
                                • Outline (2)
                                • Lower bound for all ldquon3-likerdquo linear algebra
                                • Lower bound for all ldquon3-likerdquo linear algebra (2)
                                • Lower bound for all ldquon3-likerdquo linear algebra (3)
                                • Limits to parallel scaling (12)
                                • Limits to parallel scaling (22)
                                • Can we attain these lower bounds
                                • Outline (3)
                                • 25D Matrix Multiplication
                                • 25D Matrix Multiplication (2)
                                • 25D Matmul on BGP 16K nodes 64K cores (2)
                                • Perfect Strong Scaling ndash in Time and Energy (12)
                                • Perfect Strong Scaling ndash in Time and Energy (22)
                                • Handling Heterogeneity
                                • Application to Tensor Contractions
                                • C(ijk) = Σm A(ijm)B(mk)
                                • Application to Tensor Contractions (2)
                                • Communication Lower Bounds for Strassen-like matmul algorithms
                                • vs
                                • Slide 26
                                • Strassen-like beyond matmul
                                • Cache and Network Oblivious Algorithms
                                • CARMA Performance Distributed Memory
                                • CARMA Performance Distributed Memory (2)
                                • CARMA Performance Shared Memory
                                • CARMA Performance Shared Memory (2)
                                • Why is CARMA Faster in Shared Memory
                                • Outline (4)
                                • One-sided Factorizations (LU QR) so far
                                • TSQR An Architecture-Dependent Algorithm
                                • Back to LU Using similar idea for TSLU as TSQR Use reduction
                                • Minimizing Communication in TSLU
                                • Making TSLU Numerically Stable
                                • Stability of LU using TSLU CALU
                                • Why is stability of TSLU just a ldquoThmrdquo
                                • Fixing TSLU
                                • 2D CALU with Tournament Pivoting
                                • 25D CALU with Tournament Pivoting (c=4 copies)
                                • Exascale Machine Parameters Source DOE Exascale Workshop
                                • Exascale predicted speedups for Gaussian Elimination 2D CA
                                • 25D vs 2D LU With and Without Pivoting
                                • Other CA algorithms for Ax=b least squares(13)
                                • Other CA algorithms for Ax=b least squares (23)
                                • Other CA algorithms for Ax=b least squares (33)
                                • Outline (5)
                                • What about sparse matrices (13)
                                • Performance of 25D APSP using Kleene
                                • What about sparse matrices (23)
                                • What about sparse matrices (33)
                                • Outline (6)
                                • Symmetric Eigenproblem and SVD
                                • Slide 58
                                • Slide 59
                                • Slide 60
                                • Slide 61
                                • Slide 62
                                • Slide 63
                                • Slide 64
                                • Slide 65
                                • Slide 66
                                • Slide 67
                                • Slide 68
                                • Conventional vs CA - SBR
                                • Speedups of Sym Band Reduction vs DSBTRD
                                • Nonsymmetric Eigenproblem
                                • Attaining the Lower bounds Sequential
                                • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
                                • Outline (7)
                                • Avoiding Communication in Iterative Linear Algebra
                                • Outline (8)
                                • Example The Difficulty of Tuning SpMV
                                • Example The Difficulty of Tuning
                                • Speedups on Itanium 2 The Need for Search
                                • Register Profile Itanium 2
                                • Register Profiles IBM and Intel IA-64
                                • Another example of tuning challenges for SpMV
                                • Zoom in to top corner
                                • 3x3 blocks look natural buthellip
                                • Extra Work Can Improve Efficiency
                                • Slide 86
                                • Slide 87
                                • Slide 88
                                • Slide 89
                                • Summary of Other Performance Optimizations
                                • Optimized Sparse Kernel Interface - OSKI
                                • Outline (9)
                                • Example Classical Conjugate Gradient (CG)
                                • Example CA-Conjugate Gradient
                                • Outline (10)
                                • Slide 96
                                • Slide 97
                                • Outline (11)
                                • What is a ldquosparse matrixrdquo
                                • Outline (12)
                                • Reproducible Floating Point Computation
                                • Intel MKL non-reproducibility
                                • GoalsApproaches for Reproducibility
                                • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
                                • Collaborators and Supporters
                                • Summary

                                  Perfect Strong Scaling ndash in Time and Energy (22)

                                  bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ] = T(P)cbull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP) = E(P)

                                  bull Can use these formulas to answer many questions such asndash How to choose p and M to minimize energy E needed for computationndash Given max allowed runtime T what is minimum energy E needed to achieve itndash Given max allowed energy E what is the minimum runtime T attainablendash Can we minimize the average power P = ETndash Given target energy efficiency what architectural parameters are needed to

                                  achieve itbull Can we attain 75 GflopsWattbull Can we attain an exaflop for 20 MWatts

                                  Handling Heterogeneitybull Suppose each of P processors could differ

                                  ndash γi = secflop βi = secword αi = secmessage Mi = memory

                                  bull What is optimal assignment of work Fi to minimize timendash Ti = Fi γi + Fi βi Mi

                                  12 + Fi αi Mi32 = Fi [γi + βi Mi

                                  12 + αi Mi32] = Fi ξi

                                  ndash Choose Fi so Σi Fi = n3 and minimizing T = maxi Ti

                                  ndash Answer Fi = n3(1ξi)Σj(1ξj) and T = n3Σj(1ξj)

                                  bull Optimal Algorithm for nxn matmulndash Recursively divide into 8 half-sized subproblemsndash Assign subproblems to processor i to add up to Fi flops

                                  bull Works for Strassen other algorithmshellip

                                  Application to Tensor Contractions

                                  bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

                                  bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

                                  bull Heavily used in electronic structure calculationsndash Ex NWChem

                                  bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetries

                                  ndash Solomonik Hammond Matthews

                                  C(ijk) = Σm A(ijm)B(mk)

                                  A3-fold symm

                                  B2-fold symm

                                  C2-fold symm

                                  Application to Tensor Contractions

                                  bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

                                  bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

                                  bull Heavily used in electronic structure calculationsndash Ex NWChem for coupled cluster (CC) approach to Schroedinger eqn

                                  bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetriesndash Up to 3x faster running CC than NWChem on 3072 cores of Cray XE6 ndash Solomonik Hammond Matthews

                                  Communication Lower Bounds for Strassen-like matmul algorithms

                                  bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

                                  bull Extends up to M = n2 p2ω bull Extends to rectangular case multiply (mxn)(nxp) in q mults

                                  ndash words_moved = Ω (flopsM^(logmpq -1))

                                  bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz also in JACMbull Is the lower bound attainable

                                  Classical O(n3) matmul

                                  words_moved =Ω (M(nM12)3P)

                                  Strassenrsquos O(nlg7) matmul

                                  words_moved =Ω (M(nM12)lg7P)

                                  Strassen-like O(nω) matmul

                                  words_moved =Ω (M(nM12)ωP)

                                  vs

                                  Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

                                  Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

                                  CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

                                  Communication Avoiding Parallel Strassen (CAPS)

                                  Best way to interleaveBFS and DFS is an tuning parameter

                                  26

                                  Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

                                  Speedups 24-184(over previous Strassen-based algorithms)

                                  Invited to appear as Research Highlight in CACM

                                  Strassen-like beyond matmul

                                  bull Thm (D Dumitriu Holtzrsquo07) Any Strassen-like O(nω) matmul algorithm can be used to build a numerically stable O(nω+η) algorithm for any ηgt0 for Ax=b least squares eig SVD hellipndash ηgt0 needed to deal with numerical stabilityndash Strassen already stable so η=0

                                  bull Thm For sequential versions of these algorithms Words_moved = O(nω+ηM(ω+η)2 ndash 1 + n2 log n) ie attain expected lower bound

                                  Ballard D Holtz Schwartz

                                  Cache and Network Oblivious Algorithms

                                  bull Motivation Minimizes communication at every level of a hierarchical system without tuning parameters (in theory)ndash Not always 25D Matmul on BGP was topology aware

                                  bull CAPS Divide-and-conquer choose BFS or DFS to adapt to processors available memory

                                  bull CARMAndash Divide-and-conquer classical matmul divide largest of 3

                                  dimensions to create two subproblemsndash Choose BFS or DFS to adapt to processors available memory

                                  CARMA Performance Distributed Memory

                                  Square m = k = n = 6144

                                  ScaLAPACK

                                  CARMA

                                  Peak

                                  (log)

                                  (log)

                                  Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA

                                  CARMA Performance Distributed Memory

                                  Inner Product m = n = 192 k = 6291456

                                  ScaLAPACK

                                  CARMAPeak

                                  (log)

                                  (log)

                                  Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA

                                  CARMA Performance Shared Memory

                                  Square m = k = n

                                  MKL (double)CARMA (double)

                                  MKL (single)CARMA (single)

                                  Peak (single)

                                  Peak (double)

                                  (log)

                                  (linear)

                                  Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

                                  CARMA Performance Shared Memory

                                  Inner Product m = n = 64

                                  MKL (double)

                                  CARMA (double)

                                  MKL (single)

                                  CARMA (single)

                                  (log)

                                  (linear)

                                  Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

                                  Why is CARMA Faster in Shared MemoryL3 Cache Misses

                                  Shared Memory Inner Product (m = n = 64 k = 524288)

                                  97 Fewer Misses

                                  86 Fewer Misses

                                  (linear)

                                  Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                  ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                  ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                  bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                  bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                  One-sided Factorizations (LU QR) so farbull Classical Approach for i=1 to n update column i update trailing matrixbull words_moved = O(n3)

                                  35

                                  bull Blocked Approach (LAPACK) for i=1 to nb update block i of b columns update trailing matrixbull words moved = O(n3M13)

                                  bull Recursive Approach func factor(A) if A has 1 column update it

                                  else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)

                                  bull None of these approaches minimizes messagesbull Parallel case Partial

                                  Pivoting =gt n reductionsbull Need another idea

                                  TSQR An Architecture-Dependent Algorithm

                                  W =

                                  W0

                                  W1

                                  W2

                                  W3

                                  R00

                                  R10

                                  R20

                                  R30

                                  R01

                                  R11

                                  R02Parallel

                                  W =

                                  W0

                                  W1

                                  W2

                                  W3

                                  R01 R02

                                  R00

                                  R03

                                  SequentialStreaming

                                  W =

                                  W0

                                  W1

                                  W2

                                  W3

                                  R00

                                  R01R01

                                  R11

                                  R02

                                  R11

                                  R03

                                  Dual Core

                                  Can choose reduction tree dynamicallyMulticore Multisocket Multirack Multisite Out-of-core

                                  Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

                                  Wnxb =

                                  W1

                                  W2

                                  W3

                                  W4

                                  P1middotL1middotU1

                                  P2middotL2middotU2

                                  P3middotL3middotU3

                                  P4middotL4middotU4

                                  =

                                  Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

                                  W1rsquoW2rsquoW3rsquoW4rsquo

                                  P12middotL12middotU12

                                  P34middotL34middotU34

                                  =Choose b pivot rows call them W12rsquo

                                  Choose b pivot rows call them W34rsquo

                                  W12rsquoW34rsquo

                                  = P1234middotL1234middotU1234

                                  Choose b pivot rows

                                  Go back to W and use these b pivot rows (move them to top do LU without pivoting)

                                  37

                                  Minimizing Communication in TSLU

                                  W = W1

                                  W2

                                  W3

                                  W4

                                  LULULULU

                                  LU

                                  LULUParallel

                                  W = W1

                                  W2

                                  W3

                                  W4

                                  LULU

                                  LU

                                  LUSequentialStreaming

                                  W = W1

                                  W2

                                  W3

                                  W4

                                  LULU LU

                                  LULU

                                  LULU

                                  Dual Core

                                  Can choose reduction tree dynamically to match architecture as before

                                  38

                                  Making TSLU Numerically Stable

                                  bull Details matterndash Going up the tree we could do LU either on original rows of A

                                  (tournament pivoting) or computed rows of Undash Only tournament pivoting stable

                                  bull ldquoThmrdquo New scheme as stable as Partial Pivoting (GEPP) in following sense Get same Schur complements as GEPP applied to different input matrix whose entries are blocks taken from input A

                                  bull Why just a ldquoThmrdquo

                                  39

                                  Stability of LU using TSLU CALU

                                  Summer School Lecture 4 40

                                  bull Empirical testingndash Both random matrices and ldquospecial onesrdquondash Both binary tree (BCALU) and flat-tree (FCALU)ndash 3 metrics ||PA-LU||||A|| normwise and componentwise backward errorsndash See [D Grigori Xiang 2010] for details

                                  Why is stability of TSLU just a ldquoThmrdquo

                                  bull Proof is correct ndash in exact arithmeticbull Experiment

                                  ndash Generate 100 random 6x6 rank 3 matrices in Matlabndash [LUP] = lu(A) do LU without pivoting on PA compare L factors are

                                  they the samebull Compute || L ndash Lnp || A few 0rsquos A few infinrsquos a few NaNsbull Rest mostly O(1)

                                  ndash Why Floating point is nonassociative doing arithmetic in different order gives different rounding errors

                                  ndash Same experiment with rank 6 matrices || L ndash Lnp || usually nonzero O(macheps)

                                  ndash Same experiment with 20x20 rank 4 matrices || L ndash Lnp || often O(103)

                                  bull Much harder to break TSLU but possiblendash Occurred when using TSLU to factorize a low-rank subdiagonal

                                  panel in symmetric-indefinite factorization 41

                                  Fixing TSLU

                                  bull Run TSLU quickly test for stability fix if necessary (rare)

                                  bull Test conditioning of U if not tiny (usual case) proceed elsebull Compute || L || if not big (usual case) proceed elsebull Factor A = QR using TSQR thenbull Factor Q = PLU using TSLU thenbull A = PL(UR) with UR as upper triangular factor

                                  bull Last topic in lecture how to guarantee floating point reproducibility

                                  42

                                  2D CALU with Tournament Pivoting

                                  43

                                  25D CALU with Tournament Pivoting (c=4 copies)

                                  44

                                  Exascale Machine ParametersSource DOE Exascale Workshop

                                  bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

                                  Exascale predicted speedupsfor Gaussian Elimination

                                  2D CA-LU vs ScaLAPACK-LU

                                  log2 (p)

                                  log 2

                                  (n2 p

                                  ) =

                                  log 2

                                  (mem

                                  ory_

                                  per_

                                  proc

                                  )

                                  Up to 29x

                                  25D vs 2D LUWith and Without Pivoting

                                  Other CA algorithms for Ax=b least squares(13)

                                  bull A symmetric and indefinitendash Seek factorization that retains symmetry PAPT = LDLT D

                                  ldquosimplerdquobull Save frac12 flops preserve inertia

                                  ndash Usual approach Bunch-Kaufmanbull D block diagonal with 1x1 and 2x2 blocksbull Pivot search down column along row (lots of communication)

                                  ndash Alternative Aasenbull D = tridiagonal = Tbull Two steps

                                  ndash PAPT = LTLT where T is banded using TSLU

                                  48

                                  0 0

                                  0

                                  0 0

                                  0

                                  0

                                  hellip

                                  hellip

                                  ndash Solvefactor narrow band problem with Tbull Up to 28x faster than MKL Best Paper at IPDPSrsquo13

                                  Other CA algorithms for Ax=b least squares (23)bull Minimizing bandwidth and latency for sequential GEPP

                                  ndash So far could not do partial pivoting and minimize messages just words

                                  ndash Challengebull Column layout good for choosing pivots bad for matmulbull Blocked layout good for matmul bad for choosing pivots

                                  ndash Solution use both layouts switching between thembull ldquoShape Morphing LUrdquo or SMLU

                                  49

                                  bull func factor(A) if A has 1 column update it else factor(left half of A)

                                  update right half of A

                                  factor(right half of A)

                                  bull Words = O(n3M12)

                                  bull Messages = O(n3M)

                                  bull func factor(A) if A has 1 column update it else factor(left half of A) reshape to recursive block format update right half of A reshape to columnwise format factor(right half of A)

                                  bull Words = O(n3M12)

                                  bull Messages = O(n3M32)

                                  Other CA algorithms for Ax=b least squares (33)bull Need for pivoting arises beyond LU in QR

                                  ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)

                                  ndash Usual approach like Partial Pivotingbull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound

                                  ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two

                                  groups of b columns either using usual approach or something better (GuEisenstat)

                                  bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix

                                  ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 50

                                  Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                  ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                  ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                  bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                  bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                  What about sparse matrices (13)

                                  bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

                                  bull But canrsquot reorder outer loop for 25D need another ideabull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = AB

                                  ndash Dependencies ok 25D works just different semiringbull Kleenersquos Algorithm

                                  52

                                  for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

                                  D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

                                  Performance of 25D APSP using Kleene

                                  53

                                  Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

                                  62xspeedup

                                  2x speedup

                                  What about sparse matrices (23)

                                  bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of G(A)

                                  have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

                                  bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering for

                                  2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

                                  bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

                                  multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

                                  separators)

                                  54

                                  What about sparse matrices (33)

                                  bull If matrix stays very sparse lower bound unattainable new one bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

                                  data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

                                  bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

                                  Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

                                  bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

                                  along dimensions most likely to minimize cost55

                                  Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                  ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                  ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                  bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                  bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                  Symmetric Eigenproblem and SVD

                                  bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd

                                  bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12

                                  ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea

                                  b+1

                                  b+1

                                  Successive Band Reduction (BischofLangSun)

                                  1

                                  b+1

                                  b+1

                                  d+1

                                  c

                                  Successive Band Reduction (BischofLangSun)

                                  b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                  1Q1

                                  b+1

                                  b+1

                                  d+1

                                  c

                                  b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                  Successive Band Reduction (BischofLangSun)

                                  12

                                  Q1

                                  b+1

                                  b+1

                                  d+1

                                  d+c

                                  d+c

                                  c

                                  b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                  Successive Band Reduction (BischofLangSun)

                                  1

                                  12

                                  Q1

                                  Q1T

                                  b+1

                                  b+1

                                  d+1

                                  d+1

                                  cd+c

                                  d+c

                                  c

                                  b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                  Successive Band Reduction (BischofLangSun)

                                  1

                                  1

                                  2

                                  2Q1

                                  Q1T

                                  b+1

                                  b+1

                                  d+1

                                  d+1

                                  cd+c

                                  d+c

                                  d+c

                                  d+c

                                  c

                                  b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                  Successive Band Reduction (BischofLangSun)

                                  1

                                  1

                                  2

                                  2

                                  3

                                  3

                                  Q1

                                  Q1T

                                  Q2

                                  Q2T

                                  b+1

                                  b+1

                                  d+1

                                  d+1

                                  d+c

                                  d+c

                                  d+c

                                  d+c

                                  c

                                  c

                                  b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                  Successive Band Reduction (BischofLangSun)

                                  1

                                  1

                                  2

                                  2

                                  3

                                  3

                                  4

                                  4

                                  Q1

                                  Q1T

                                  Q2

                                  Q2T

                                  Q3

                                  Q3T

                                  b+1

                                  b+1

                                  d+1

                                  d+1

                                  d+c

                                  d+c

                                  d+c

                                  d+c

                                  c

                                  c

                                  b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                  Successive Band Reduction (BischofLangSun)

                                  1

                                  1

                                  2

                                  2

                                  3

                                  3

                                  4

                                  4

                                  5

                                  5

                                  Q1

                                  Q1T

                                  Q2

                                  Q2T

                                  Q3

                                  Q3T

                                  Q4

                                  Q4T

                                  b+1

                                  b+1

                                  d+1

                                  d+1

                                  c

                                  c

                                  d+c

                                  d+c

                                  d+c

                                  d+c

                                  b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                  Successive Band Reduction (BischofLangSun)

                                  1

                                  1

                                  2

                                  2

                                  3

                                  3

                                  4

                                  4

                                  5

                                  5

                                  Q5T

                                  Q1

                                  Q1T

                                  Q2

                                  Q2T

                                  Q3

                                  Q3T

                                  Q5

                                  Q4

                                  Q4T

                                  b+1

                                  b+1

                                  d+1

                                  d+1

                                  c

                                  c

                                  d+c

                                  d+c

                                  d+c

                                  d+c

                                  b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                  Successive Band Reduction (BischofLangSun)

                                  1

                                  1

                                  2

                                  2

                                  3

                                  3

                                  4

                                  4

                                  5

                                  5

                                  6

                                  6

                                  Q5T

                                  Q1

                                  Q1T

                                  Q2

                                  Q2T

                                  Q3

                                  Q3T

                                  Q5

                                  Q4

                                  Q4T

                                  b+1

                                  b+1

                                  d+1

                                  d+1

                                  c

                                  c

                                  d+c

                                  d+c

                                  d+c

                                  d+c

                                  b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                  Successive Band Reduction (BischofLangSun)

                                  Conventional vs CA - SBR

                                  Conventional Communication-Avoiding

                                  Touch all data 4 times Touch all data once

                                  >
                                  >

                                  Speedups of Sym Band Reductionvs DSBTRD

                                  bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

                                  bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

                                  bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

                                  bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

                                  bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

                                  Nonsymmetric Eigenproblem

                                  bull No apparent way to modify standard algorithmbull Instead Spectral Divide-and-Conquer

                                  ndash Find orthogonal matrix Q whose leading columns span an invariant subspace of A

                                  ndash QTAQ will be block upper triangularndash Apply recursively to A11 A22

                                  ndash Depends on randomization1 Randomized Rank Revealing QR decomposition2 Randomized location to try splitting spectrum

                                  A11 A12

                                  ε A22

                                  Attaining the Lower bounds SequentialLegend[Existing][Ours][Math-Lib][Random]

                                  Two Levels Memory Hierarchy

                                  Words Messages Words Messages

                                  BLAS-3 [FLPRrsquo99][BDLSTrsquo13][MKL etc] [FLPRrsquo99][BDLSTrsquo13][MKL etc]

                                  Cholesky[Grsquo97][APrsquo00]

                                  [LAPACK][BDHSrsquo09]

                                  [Grsquo97][APrsquo00][BDHSrsquo09] [Grsquo97][APrsquo00][BDHSrsquo09]

                                  Sym Indefinite [BBDDDPSTYrsquo13] [BBDDDPSTYrsquo13]

                                  LU[Grsquo97][Trsquo97]

                                  [GDXrsquo11][BDLSTrsquo13]

                                  [GDXrsquo11][BDLSTrsquo13]

                                  [Grsquo97][Trsquo97] [BDLSTrsquo13] [BDLSTrsquo13]

                                  QR[EGrsquo98][FWrsquo03]

                                  [DGHLrsquo12][BDLSTrsquo13]

                                  [FWrsquo03][DGHLrsquo12][BDLSTrsquo13]

                                  [EGrsquo98][FWrsquo03][BDLSTrsquo13]

                                  [FWrsquo03][BDLSTrsquo13]

                                  Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

                                  Sym Eig amp SVD [BDDrsquo11][BDKrsquo13] [BDDrsquo11]

                                  Non Sym Eig [BDDrsquo11] [BDDrsquo11]

                                  Legend[Existing][Ours][Math-Lib][Random]

                                  Words (BW) Messages (L) Saving factor

                                  BLAS-3 [AGZrsquo94][MTrsquo99][ScaLAPACK][Crsquo69][vGWrsquo97][SDrsquo11] L nP12

                                  Cholesky [ScaLAPACK][Trsquo99][SDrsquo11] L nP12

                                  Sym Indefinite [BBDDDPSTYrsquo13][ScaLAPACK] [BBDDDPSTYrsquo13] L nP12

                                  LU [ScaLAPACK][GDXrsquo11][Trsquo99][SDrsquo11] [GDXrsquo11][Trsquo99][SDrsquo11] L nP12

                                  QR [ScaLAPACK][DGHLrsquo12] [Trsquo99] [DGHLrsquo12][Trsquo99] L nP12

                                  Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

                                  Sym Eig amp SVD [BDDrsquo11][BDKrsquo13][ScaLAPACK] [BDDrsquo11][BDKrsquo13] L nP12

                                  Non-Sym Eig [BDDrsquo11] [BDDrsquo11] BW P12 L n

                                  Attaining with extra memory 25D M=(cn2P)

                                  Attaining the Lower bounds Parallel 2DM=(n2P)(Ignoring poly-log(P) factors words = ( n2 P12) messages = (P12)

                                  Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                  ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                  ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                  bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                  bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                  Avoiding Communication in Iterative Linear Algebra

                                  bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

                                  bull Conjugate Gradients (CG) GMRES Lanczos Arnoldi hellip bull Goal minimize communication

                                  ndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

                                  bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

                                  ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

                                  bull Lots of speed up possible (modeled and measured)ndash Price some redundant computationndash Challenges Poor partitioning Preconditioning Num Stability

                                  75

                                  Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                  ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                  ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                  bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                  bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                  Example The Difficulty of Tuning SpMV

                                  bull n = 21200bull nnz = 15 M

                                  bull Source NASA structural analysis problem (raefsky)

                                  77

                                  Example The Difficulty of Tuning

                                  bull n = 21200bull nnz = 15 M

                                  bull Source NASA structural analysis problem (raefsky)

                                  bull 8x8 dense substructure exploit this to limit mem_refs

                                  78

                                  Speedups on Itanium 2 The Need for Search

                                  Reference

                                  Best 4x2

                                  Mflops

                                  Mflops

                                  79

                                  Register Profile Itanium 2

                                  190 Mflops

                                  1190 Mflops

                                  80

                                  Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16

                                  Itanium 2 - 33Itanium 1 - 8

                                  252 Mflops

                                  122 Mflops

                                  820 Mflops

                                  459 Mflops

                                  247 Mflops

                                  107 Mflops

                                  12 Gflops

                                  190 Mflops

                                  Another example of tuning challenges for SpMV

                                  bull Ex11 matrix (fluid flow)

                                  bull More complicated non-zero structure in general

                                  bull N = 16614bull NNZ = 11M

                                  82

                                  Zoom in to top corner

                                  bull More complicated non-zero structure in general

                                  bull N = 16614bull NNZ = 11M

                                  83

                                  3x3 blocks look natural buthellip

                                  bull Example 3x3 blockingndash Logical grid of 3x3 cells

                                  bull But would lead to lots of ldquofill-inrdquo

                                  84

                                  Extra Work Can Improve Efficiency

                                  bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

                                  bull On Pentium III 15x speedupndash Actual mflop rate 152 = 225 higher

                                  85

                                  Source Accelerator Cavity Design Problem (Ko via Husbands)

                                  86

                                  100x100 Submatrix Along Diagonal

                                  Summer School Lecture 787

                                  Post-RCM Reordering

                                  88

                                  Effect of Combined RCM+TSP Reordering

                                  Before Green + RedAfter Green + Blue

                                  Summer School Lecture 789

                                  2x speedups on Pentium 4 Power 4 hellip

                                  Summary of Other Performance Optimizations

                                  bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

                                  bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

                                  bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

                                  90

                                  Optimized Sparse Kernel Interface - OSKI

                                  bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

                                  bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

                                  bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

                                  software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

                                  91

                                  Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                  ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                  ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                  bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                  bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                  93

                                  Example Classical Conjugate Gradient (CG)

                                  SpMVs and dot products require communication in

                                  each iteration

                                  via CA Matrix Powers Kernel

                                  Global reduction to compute G

                                  94

                                  Example CA-Conjugate Gradient

                                  Local computations within inner loop require

                                  no communication

                                  Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                  ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                  ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                  bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                  bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                  96

                                  Slower convergence due

                                  to roundoff

                                  Loss of accuracy due to roundoff

                                  At s = 16 monomial basis is rank deficient Method breaks down

                                  Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

                                  CA-CG (monomial)CG

                                  machine precision

                                  97

                                  Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                  ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                  ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                  bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                  bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                  What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

                                  bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

                                  bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

                                  matrices

                                  Explicit (O(nnz)) Implicit (o(nnz))

                                  Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

                                  Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

                                  Indices

                                  Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                  ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                  ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                  bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                  bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                  101

                                  bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

                                  ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

                                  demanded by customers (construction engineers) otherwise they donrsquot believe results

                                  ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

                                  ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

                                  Reproducible Floating Point Computation

                                  Absolute Error for Random Vectors

                                  Same magnitude opposite signs

                                  Intel MKL non-reproducibility

                                  Relative Error for Orthogonal vectors

                                  Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

                                  Sign notreproducible

                                  103

                                  bull Consider summation or dot productbull Goals

                                  1 Same answer independent of layout processors order of summands

                                  2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

                                  bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

                                  GoalsApproaches for Reproducibility

                                  104

                                  Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

                                  Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

                                  Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

                                  bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

                                  Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

                                  bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

                                  Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

                                  Instruments NEC Nokia NVIDIA Samsung Oracle

                                  bull bebopcsberkeleyedu

                                  Summary

                                  Donrsquot Communichellip

                                  106

                                  Time to redesign all linear algebra n-body hellip algorithms and software

                                  (and compilers)

                                  • Implementing Communication-Avoiding Algorithms
                                  • Why avoid communication
                                  • Goals
                                  • Outline
                                  • Outline (2)
                                  • Lower bound for all ldquon3-likerdquo linear algebra
                                  • Lower bound for all ldquon3-likerdquo linear algebra (2)
                                  • Lower bound for all ldquon3-likerdquo linear algebra (3)
                                  • Limits to parallel scaling (12)
                                  • Limits to parallel scaling (22)
                                  • Can we attain these lower bounds
                                  • Outline (3)
                                  • 25D Matrix Multiplication
                                  • 25D Matrix Multiplication (2)
                                  • 25D Matmul on BGP 16K nodes 64K cores (2)
                                  • Perfect Strong Scaling ndash in Time and Energy (12)
                                  • Perfect Strong Scaling ndash in Time and Energy (22)
                                  • Handling Heterogeneity
                                  • Application to Tensor Contractions
                                  • C(ijk) = Σm A(ijm)B(mk)
                                  • Application to Tensor Contractions (2)
                                  • Communication Lower Bounds for Strassen-like matmul algorithms
                                  • vs
                                  • Slide 26
                                  • Strassen-like beyond matmul
                                  • Cache and Network Oblivious Algorithms
                                  • CARMA Performance Distributed Memory
                                  • CARMA Performance Distributed Memory (2)
                                  • CARMA Performance Shared Memory
                                  • CARMA Performance Shared Memory (2)
                                  • Why is CARMA Faster in Shared Memory
                                  • Outline (4)
                                  • One-sided Factorizations (LU QR) so far
                                  • TSQR An Architecture-Dependent Algorithm
                                  • Back to LU Using similar idea for TSLU as TSQR Use reduction
                                  • Minimizing Communication in TSLU
                                  • Making TSLU Numerically Stable
                                  • Stability of LU using TSLU CALU
                                  • Why is stability of TSLU just a ldquoThmrdquo
                                  • Fixing TSLU
                                  • 2D CALU with Tournament Pivoting
                                  • 25D CALU with Tournament Pivoting (c=4 copies)
                                  • Exascale Machine Parameters Source DOE Exascale Workshop
                                  • Exascale predicted speedups for Gaussian Elimination 2D CA
                                  • 25D vs 2D LU With and Without Pivoting
                                  • Other CA algorithms for Ax=b least squares(13)
                                  • Other CA algorithms for Ax=b least squares (23)
                                  • Other CA algorithms for Ax=b least squares (33)
                                  • Outline (5)
                                  • What about sparse matrices (13)
                                  • Performance of 25D APSP using Kleene
                                  • What about sparse matrices (23)
                                  • What about sparse matrices (33)
                                  • Outline (6)
                                  • Symmetric Eigenproblem and SVD
                                  • Slide 58
                                  • Slide 59
                                  • Slide 60
                                  • Slide 61
                                  • Slide 62
                                  • Slide 63
                                  • Slide 64
                                  • Slide 65
                                  • Slide 66
                                  • Slide 67
                                  • Slide 68
                                  • Conventional vs CA - SBR
                                  • Speedups of Sym Band Reduction vs DSBTRD
                                  • Nonsymmetric Eigenproblem
                                  • Attaining the Lower bounds Sequential
                                  • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
                                  • Outline (7)
                                  • Avoiding Communication in Iterative Linear Algebra
                                  • Outline (8)
                                  • Example The Difficulty of Tuning SpMV
                                  • Example The Difficulty of Tuning
                                  • Speedups on Itanium 2 The Need for Search
                                  • Register Profile Itanium 2
                                  • Register Profiles IBM and Intel IA-64
                                  • Another example of tuning challenges for SpMV
                                  • Zoom in to top corner
                                  • 3x3 blocks look natural buthellip
                                  • Extra Work Can Improve Efficiency
                                  • Slide 86
                                  • Slide 87
                                  • Slide 88
                                  • Slide 89
                                  • Summary of Other Performance Optimizations
                                  • Optimized Sparse Kernel Interface - OSKI
                                  • Outline (9)
                                  • Example Classical Conjugate Gradient (CG)
                                  • Example CA-Conjugate Gradient
                                  • Outline (10)
                                  • Slide 96
                                  • Slide 97
                                  • Outline (11)
                                  • What is a ldquosparse matrixrdquo
                                  • Outline (12)
                                  • Reproducible Floating Point Computation
                                  • Intel MKL non-reproducibility
                                  • GoalsApproaches for Reproducibility
                                  • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
                                  • Collaborators and Supporters
                                  • Summary

                                    Handling Heterogeneitybull Suppose each of P processors could differ

                                    ndash γi = secflop βi = secword αi = secmessage Mi = memory

                                    bull What is optimal assignment of work Fi to minimize timendash Ti = Fi γi + Fi βi Mi

                                    12 + Fi αi Mi32 = Fi [γi + βi Mi

                                    12 + αi Mi32] = Fi ξi

                                    ndash Choose Fi so Σi Fi = n3 and minimizing T = maxi Ti

                                    ndash Answer Fi = n3(1ξi)Σj(1ξj) and T = n3Σj(1ξj)

                                    bull Optimal Algorithm for nxn matmulndash Recursively divide into 8 half-sized subproblemsndash Assign subproblems to processor i to add up to Fi flops

                                    bull Works for Strassen other algorithmshellip

                                    Application to Tensor Contractions

                                    bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

                                    bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

                                    bull Heavily used in electronic structure calculationsndash Ex NWChem

                                    bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetries

                                    ndash Solomonik Hammond Matthews

                                    C(ijk) = Σm A(ijm)B(mk)

                                    A3-fold symm

                                    B2-fold symm

                                    C2-fold symm

                                    Application to Tensor Contractions

                                    bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

                                    bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

                                    bull Heavily used in electronic structure calculationsndash Ex NWChem for coupled cluster (CC) approach to Schroedinger eqn

                                    bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetriesndash Up to 3x faster running CC than NWChem on 3072 cores of Cray XE6 ndash Solomonik Hammond Matthews

                                    Communication Lower Bounds for Strassen-like matmul algorithms

                                    bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

                                    bull Extends up to M = n2 p2ω bull Extends to rectangular case multiply (mxn)(nxp) in q mults

                                    ndash words_moved = Ω (flopsM^(logmpq -1))

                                    bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz also in JACMbull Is the lower bound attainable

                                    Classical O(n3) matmul

                                    words_moved =Ω (M(nM12)3P)

                                    Strassenrsquos O(nlg7) matmul

                                    words_moved =Ω (M(nM12)lg7P)

                                    Strassen-like O(nω) matmul

                                    words_moved =Ω (M(nM12)ωP)

                                    vs

                                    Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

                                    Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

                                    CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

                                    Communication Avoiding Parallel Strassen (CAPS)

                                    Best way to interleaveBFS and DFS is an tuning parameter

                                    26

                                    Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

                                    Speedups 24-184(over previous Strassen-based algorithms)

                                    Invited to appear as Research Highlight in CACM

                                    Strassen-like beyond matmul

                                    bull Thm (D Dumitriu Holtzrsquo07) Any Strassen-like O(nω) matmul algorithm can be used to build a numerically stable O(nω+η) algorithm for any ηgt0 for Ax=b least squares eig SVD hellipndash ηgt0 needed to deal with numerical stabilityndash Strassen already stable so η=0

                                    bull Thm For sequential versions of these algorithms Words_moved = O(nω+ηM(ω+η)2 ndash 1 + n2 log n) ie attain expected lower bound

                                    Ballard D Holtz Schwartz

                                    Cache and Network Oblivious Algorithms

                                    bull Motivation Minimizes communication at every level of a hierarchical system without tuning parameters (in theory)ndash Not always 25D Matmul on BGP was topology aware

                                    bull CAPS Divide-and-conquer choose BFS or DFS to adapt to processors available memory

                                    bull CARMAndash Divide-and-conquer classical matmul divide largest of 3

                                    dimensions to create two subproblemsndash Choose BFS or DFS to adapt to processors available memory

                                    CARMA Performance Distributed Memory

                                    Square m = k = n = 6144

                                    ScaLAPACK

                                    CARMA

                                    Peak

                                    (log)

                                    (log)

                                    Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA

                                    CARMA Performance Distributed Memory

                                    Inner Product m = n = 192 k = 6291456

                                    ScaLAPACK

                                    CARMAPeak

                                    (log)

                                    (log)

                                    Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA

                                    CARMA Performance Shared Memory

                                    Square m = k = n

                                    MKL (double)CARMA (double)

                                    MKL (single)CARMA (single)

                                    Peak (single)

                                    Peak (double)

                                    (log)

                                    (linear)

                                    Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

                                    CARMA Performance Shared Memory

                                    Inner Product m = n = 64

                                    MKL (double)

                                    CARMA (double)

                                    MKL (single)

                                    CARMA (single)

                                    (log)

                                    (linear)

                                    Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

                                    Why is CARMA Faster in Shared MemoryL3 Cache Misses

                                    Shared Memory Inner Product (m = n = 64 k = 524288)

                                    97 Fewer Misses

                                    86 Fewer Misses

                                    (linear)

                                    Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                    ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                    ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                    bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                    bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                    One-sided Factorizations (LU QR) so farbull Classical Approach for i=1 to n update column i update trailing matrixbull words_moved = O(n3)

                                    35

                                    bull Blocked Approach (LAPACK) for i=1 to nb update block i of b columns update trailing matrixbull words moved = O(n3M13)

                                    bull Recursive Approach func factor(A) if A has 1 column update it

                                    else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)

                                    bull None of these approaches minimizes messagesbull Parallel case Partial

                                    Pivoting =gt n reductionsbull Need another idea

                                    TSQR An Architecture-Dependent Algorithm

                                    W =

                                    W0

                                    W1

                                    W2

                                    W3

                                    R00

                                    R10

                                    R20

                                    R30

                                    R01

                                    R11

                                    R02Parallel

                                    W =

                                    W0

                                    W1

                                    W2

                                    W3

                                    R01 R02

                                    R00

                                    R03

                                    SequentialStreaming

                                    W =

                                    W0

                                    W1

                                    W2

                                    W3

                                    R00

                                    R01R01

                                    R11

                                    R02

                                    R11

                                    R03

                                    Dual Core

                                    Can choose reduction tree dynamicallyMulticore Multisocket Multirack Multisite Out-of-core

                                    Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

                                    Wnxb =

                                    W1

                                    W2

                                    W3

                                    W4

                                    P1middotL1middotU1

                                    P2middotL2middotU2

                                    P3middotL3middotU3

                                    P4middotL4middotU4

                                    =

                                    Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

                                    W1rsquoW2rsquoW3rsquoW4rsquo

                                    P12middotL12middotU12

                                    P34middotL34middotU34

                                    =Choose b pivot rows call them W12rsquo

                                    Choose b pivot rows call them W34rsquo

                                    W12rsquoW34rsquo

                                    = P1234middotL1234middotU1234

                                    Choose b pivot rows

                                    Go back to W and use these b pivot rows (move them to top do LU without pivoting)

                                    37

                                    Minimizing Communication in TSLU

                                    W = W1

                                    W2

                                    W3

                                    W4

                                    LULULULU

                                    LU

                                    LULUParallel

                                    W = W1

                                    W2

                                    W3

                                    W4

                                    LULU

                                    LU

                                    LUSequentialStreaming

                                    W = W1

                                    W2

                                    W3

                                    W4

                                    LULU LU

                                    LULU

                                    LULU

                                    Dual Core

                                    Can choose reduction tree dynamically to match architecture as before

                                    38

                                    Making TSLU Numerically Stable

                                    bull Details matterndash Going up the tree we could do LU either on original rows of A

                                    (tournament pivoting) or computed rows of Undash Only tournament pivoting stable

                                    bull ldquoThmrdquo New scheme as stable as Partial Pivoting (GEPP) in following sense Get same Schur complements as GEPP applied to different input matrix whose entries are blocks taken from input A

                                    bull Why just a ldquoThmrdquo

                                    39

                                    Stability of LU using TSLU CALU

                                    Summer School Lecture 4 40

                                    bull Empirical testingndash Both random matrices and ldquospecial onesrdquondash Both binary tree (BCALU) and flat-tree (FCALU)ndash 3 metrics ||PA-LU||||A|| normwise and componentwise backward errorsndash See [D Grigori Xiang 2010] for details

                                    Why is stability of TSLU just a ldquoThmrdquo

                                    bull Proof is correct ndash in exact arithmeticbull Experiment

                                    ndash Generate 100 random 6x6 rank 3 matrices in Matlabndash [LUP] = lu(A) do LU without pivoting on PA compare L factors are

                                    they the samebull Compute || L ndash Lnp || A few 0rsquos A few infinrsquos a few NaNsbull Rest mostly O(1)

                                    ndash Why Floating point is nonassociative doing arithmetic in different order gives different rounding errors

                                    ndash Same experiment with rank 6 matrices || L ndash Lnp || usually nonzero O(macheps)

                                    ndash Same experiment with 20x20 rank 4 matrices || L ndash Lnp || often O(103)

                                    bull Much harder to break TSLU but possiblendash Occurred when using TSLU to factorize a low-rank subdiagonal

                                    panel in symmetric-indefinite factorization 41

                                    Fixing TSLU

                                    bull Run TSLU quickly test for stability fix if necessary (rare)

                                    bull Test conditioning of U if not tiny (usual case) proceed elsebull Compute || L || if not big (usual case) proceed elsebull Factor A = QR using TSQR thenbull Factor Q = PLU using TSLU thenbull A = PL(UR) with UR as upper triangular factor

                                    bull Last topic in lecture how to guarantee floating point reproducibility

                                    42

                                    2D CALU with Tournament Pivoting

                                    43

                                    25D CALU with Tournament Pivoting (c=4 copies)

                                    44

                                    Exascale Machine ParametersSource DOE Exascale Workshop

                                    bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

                                    Exascale predicted speedupsfor Gaussian Elimination

                                    2D CA-LU vs ScaLAPACK-LU

                                    log2 (p)

                                    log 2

                                    (n2 p

                                    ) =

                                    log 2

                                    (mem

                                    ory_

                                    per_

                                    proc

                                    )

                                    Up to 29x

                                    25D vs 2D LUWith and Without Pivoting

                                    Other CA algorithms for Ax=b least squares(13)

                                    bull A symmetric and indefinitendash Seek factorization that retains symmetry PAPT = LDLT D

                                    ldquosimplerdquobull Save frac12 flops preserve inertia

                                    ndash Usual approach Bunch-Kaufmanbull D block diagonal with 1x1 and 2x2 blocksbull Pivot search down column along row (lots of communication)

                                    ndash Alternative Aasenbull D = tridiagonal = Tbull Two steps

                                    ndash PAPT = LTLT where T is banded using TSLU

                                    48

                                    0 0

                                    0

                                    0 0

                                    0

                                    0

                                    hellip

                                    hellip

                                    ndash Solvefactor narrow band problem with Tbull Up to 28x faster than MKL Best Paper at IPDPSrsquo13

                                    Other CA algorithms for Ax=b least squares (23)bull Minimizing bandwidth and latency for sequential GEPP

                                    ndash So far could not do partial pivoting and minimize messages just words

                                    ndash Challengebull Column layout good for choosing pivots bad for matmulbull Blocked layout good for matmul bad for choosing pivots

                                    ndash Solution use both layouts switching between thembull ldquoShape Morphing LUrdquo or SMLU

                                    49

                                    bull func factor(A) if A has 1 column update it else factor(left half of A)

                                    update right half of A

                                    factor(right half of A)

                                    bull Words = O(n3M12)

                                    bull Messages = O(n3M)

                                    bull func factor(A) if A has 1 column update it else factor(left half of A) reshape to recursive block format update right half of A reshape to columnwise format factor(right half of A)

                                    bull Words = O(n3M12)

                                    bull Messages = O(n3M32)

                                    Other CA algorithms for Ax=b least squares (33)bull Need for pivoting arises beyond LU in QR

                                    ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)

                                    ndash Usual approach like Partial Pivotingbull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound

                                    ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two

                                    groups of b columns either using usual approach or something better (GuEisenstat)

                                    bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix

                                    ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 50

                                    Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                    ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                    ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                    bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                    bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                    What about sparse matrices (13)

                                    bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

                                    bull But canrsquot reorder outer loop for 25D need another ideabull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = AB

                                    ndash Dependencies ok 25D works just different semiringbull Kleenersquos Algorithm

                                    52

                                    for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

                                    D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

                                    Performance of 25D APSP using Kleene

                                    53

                                    Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

                                    62xspeedup

                                    2x speedup

                                    What about sparse matrices (23)

                                    bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of G(A)

                                    have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

                                    bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering for

                                    2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

                                    bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

                                    multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

                                    separators)

                                    54

                                    What about sparse matrices (33)

                                    bull If matrix stays very sparse lower bound unattainable new one bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

                                    data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

                                    bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

                                    Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

                                    bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

                                    along dimensions most likely to minimize cost55

                                    Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                    ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                    ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                    bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                    bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                    Symmetric Eigenproblem and SVD

                                    bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd

                                    bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12

                                    ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea

                                    b+1

                                    b+1

                                    Successive Band Reduction (BischofLangSun)

                                    1

                                    b+1

                                    b+1

                                    d+1

                                    c

                                    Successive Band Reduction (BischofLangSun)

                                    b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                    1Q1

                                    b+1

                                    b+1

                                    d+1

                                    c

                                    b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                    Successive Band Reduction (BischofLangSun)

                                    12

                                    Q1

                                    b+1

                                    b+1

                                    d+1

                                    d+c

                                    d+c

                                    c

                                    b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                    Successive Band Reduction (BischofLangSun)

                                    1

                                    12

                                    Q1

                                    Q1T

                                    b+1

                                    b+1

                                    d+1

                                    d+1

                                    cd+c

                                    d+c

                                    c

                                    b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                    Successive Band Reduction (BischofLangSun)

                                    1

                                    1

                                    2

                                    2Q1

                                    Q1T

                                    b+1

                                    b+1

                                    d+1

                                    d+1

                                    cd+c

                                    d+c

                                    d+c

                                    d+c

                                    c

                                    b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                    Successive Band Reduction (BischofLangSun)

                                    1

                                    1

                                    2

                                    2

                                    3

                                    3

                                    Q1

                                    Q1T

                                    Q2

                                    Q2T

                                    b+1

                                    b+1

                                    d+1

                                    d+1

                                    d+c

                                    d+c

                                    d+c

                                    d+c

                                    c

                                    c

                                    b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                    Successive Band Reduction (BischofLangSun)

                                    1

                                    1

                                    2

                                    2

                                    3

                                    3

                                    4

                                    4

                                    Q1

                                    Q1T

                                    Q2

                                    Q2T

                                    Q3

                                    Q3T

                                    b+1

                                    b+1

                                    d+1

                                    d+1

                                    d+c

                                    d+c

                                    d+c

                                    d+c

                                    c

                                    c

                                    b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                    Successive Band Reduction (BischofLangSun)

                                    1

                                    1

                                    2

                                    2

                                    3

                                    3

                                    4

                                    4

                                    5

                                    5

                                    Q1

                                    Q1T

                                    Q2

                                    Q2T

                                    Q3

                                    Q3T

                                    Q4

                                    Q4T

                                    b+1

                                    b+1

                                    d+1

                                    d+1

                                    c

                                    c

                                    d+c

                                    d+c

                                    d+c

                                    d+c

                                    b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                    Successive Band Reduction (BischofLangSun)

                                    1

                                    1

                                    2

                                    2

                                    3

                                    3

                                    4

                                    4

                                    5

                                    5

                                    Q5T

                                    Q1

                                    Q1T

                                    Q2

                                    Q2T

                                    Q3

                                    Q3T

                                    Q5

                                    Q4

                                    Q4T

                                    b+1

                                    b+1

                                    d+1

                                    d+1

                                    c

                                    c

                                    d+c

                                    d+c

                                    d+c

                                    d+c

                                    b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                    Successive Band Reduction (BischofLangSun)

                                    1

                                    1

                                    2

                                    2

                                    3

                                    3

                                    4

                                    4

                                    5

                                    5

                                    6

                                    6

                                    Q5T

                                    Q1

                                    Q1T

                                    Q2

                                    Q2T

                                    Q3

                                    Q3T

                                    Q5

                                    Q4

                                    Q4T

                                    b+1

                                    b+1

                                    d+1

                                    d+1

                                    c

                                    c

                                    d+c

                                    d+c

                                    d+c

                                    d+c

                                    b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                    Successive Band Reduction (BischofLangSun)

                                    Conventional vs CA - SBR

                                    Conventional Communication-Avoiding

                                    Touch all data 4 times Touch all data once

                                    >
                                    >

                                    Speedups of Sym Band Reductionvs DSBTRD

                                    bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

                                    bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

                                    bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

                                    bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

                                    bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

                                    Nonsymmetric Eigenproblem

                                    bull No apparent way to modify standard algorithmbull Instead Spectral Divide-and-Conquer

                                    ndash Find orthogonal matrix Q whose leading columns span an invariant subspace of A

                                    ndash QTAQ will be block upper triangularndash Apply recursively to A11 A22

                                    ndash Depends on randomization1 Randomized Rank Revealing QR decomposition2 Randomized location to try splitting spectrum

                                    A11 A12

                                    ε A22

                                    Attaining the Lower bounds SequentialLegend[Existing][Ours][Math-Lib][Random]

                                    Two Levels Memory Hierarchy

                                    Words Messages Words Messages

                                    BLAS-3 [FLPRrsquo99][BDLSTrsquo13][MKL etc] [FLPRrsquo99][BDLSTrsquo13][MKL etc]

                                    Cholesky[Grsquo97][APrsquo00]

                                    [LAPACK][BDHSrsquo09]

                                    [Grsquo97][APrsquo00][BDHSrsquo09] [Grsquo97][APrsquo00][BDHSrsquo09]

                                    Sym Indefinite [BBDDDPSTYrsquo13] [BBDDDPSTYrsquo13]

                                    LU[Grsquo97][Trsquo97]

                                    [GDXrsquo11][BDLSTrsquo13]

                                    [GDXrsquo11][BDLSTrsquo13]

                                    [Grsquo97][Trsquo97] [BDLSTrsquo13] [BDLSTrsquo13]

                                    QR[EGrsquo98][FWrsquo03]

                                    [DGHLrsquo12][BDLSTrsquo13]

                                    [FWrsquo03][DGHLrsquo12][BDLSTrsquo13]

                                    [EGrsquo98][FWrsquo03][BDLSTrsquo13]

                                    [FWrsquo03][BDLSTrsquo13]

                                    Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

                                    Sym Eig amp SVD [BDDrsquo11][BDKrsquo13] [BDDrsquo11]

                                    Non Sym Eig [BDDrsquo11] [BDDrsquo11]

                                    Legend[Existing][Ours][Math-Lib][Random]

                                    Words (BW) Messages (L) Saving factor

                                    BLAS-3 [AGZrsquo94][MTrsquo99][ScaLAPACK][Crsquo69][vGWrsquo97][SDrsquo11] L nP12

                                    Cholesky [ScaLAPACK][Trsquo99][SDrsquo11] L nP12

                                    Sym Indefinite [BBDDDPSTYrsquo13][ScaLAPACK] [BBDDDPSTYrsquo13] L nP12

                                    LU [ScaLAPACK][GDXrsquo11][Trsquo99][SDrsquo11] [GDXrsquo11][Trsquo99][SDrsquo11] L nP12

                                    QR [ScaLAPACK][DGHLrsquo12] [Trsquo99] [DGHLrsquo12][Trsquo99] L nP12

                                    Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

                                    Sym Eig amp SVD [BDDrsquo11][BDKrsquo13][ScaLAPACK] [BDDrsquo11][BDKrsquo13] L nP12

                                    Non-Sym Eig [BDDrsquo11] [BDDrsquo11] BW P12 L n

                                    Attaining with extra memory 25D M=(cn2P)

                                    Attaining the Lower bounds Parallel 2DM=(n2P)(Ignoring poly-log(P) factors words = ( n2 P12) messages = (P12)

                                    Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                    ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                    ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                    bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                    bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                    Avoiding Communication in Iterative Linear Algebra

                                    bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

                                    bull Conjugate Gradients (CG) GMRES Lanczos Arnoldi hellip bull Goal minimize communication

                                    ndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

                                    bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

                                    ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

                                    bull Lots of speed up possible (modeled and measured)ndash Price some redundant computationndash Challenges Poor partitioning Preconditioning Num Stability

                                    75

                                    Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                    ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                    ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                    bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                    bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                    Example The Difficulty of Tuning SpMV

                                    bull n = 21200bull nnz = 15 M

                                    bull Source NASA structural analysis problem (raefsky)

                                    77

                                    Example The Difficulty of Tuning

                                    bull n = 21200bull nnz = 15 M

                                    bull Source NASA structural analysis problem (raefsky)

                                    bull 8x8 dense substructure exploit this to limit mem_refs

                                    78

                                    Speedups on Itanium 2 The Need for Search

                                    Reference

                                    Best 4x2

                                    Mflops

                                    Mflops

                                    79

                                    Register Profile Itanium 2

                                    190 Mflops

                                    1190 Mflops

                                    80

                                    Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16

                                    Itanium 2 - 33Itanium 1 - 8

                                    252 Mflops

                                    122 Mflops

                                    820 Mflops

                                    459 Mflops

                                    247 Mflops

                                    107 Mflops

                                    12 Gflops

                                    190 Mflops

                                    Another example of tuning challenges for SpMV

                                    bull Ex11 matrix (fluid flow)

                                    bull More complicated non-zero structure in general

                                    bull N = 16614bull NNZ = 11M

                                    82

                                    Zoom in to top corner

                                    bull More complicated non-zero structure in general

                                    bull N = 16614bull NNZ = 11M

                                    83

                                    3x3 blocks look natural buthellip

                                    bull Example 3x3 blockingndash Logical grid of 3x3 cells

                                    bull But would lead to lots of ldquofill-inrdquo

                                    84

                                    Extra Work Can Improve Efficiency

                                    bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

                                    bull On Pentium III 15x speedupndash Actual mflop rate 152 = 225 higher

                                    85

                                    Source Accelerator Cavity Design Problem (Ko via Husbands)

                                    86

                                    100x100 Submatrix Along Diagonal

                                    Summer School Lecture 787

                                    Post-RCM Reordering

                                    88

                                    Effect of Combined RCM+TSP Reordering

                                    Before Green + RedAfter Green + Blue

                                    Summer School Lecture 789

                                    2x speedups on Pentium 4 Power 4 hellip

                                    Summary of Other Performance Optimizations

                                    bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

                                    bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

                                    bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

                                    90

                                    Optimized Sparse Kernel Interface - OSKI

                                    bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

                                    bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

                                    bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

                                    software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

                                    91

                                    Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                    ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                    ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                    bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                    bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                    93

                                    Example Classical Conjugate Gradient (CG)

                                    SpMVs and dot products require communication in

                                    each iteration

                                    via CA Matrix Powers Kernel

                                    Global reduction to compute G

                                    94

                                    Example CA-Conjugate Gradient

                                    Local computations within inner loop require

                                    no communication

                                    Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                    ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                    ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                    bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                    bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                    96

                                    Slower convergence due

                                    to roundoff

                                    Loss of accuracy due to roundoff

                                    At s = 16 monomial basis is rank deficient Method breaks down

                                    Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

                                    CA-CG (monomial)CG

                                    machine precision

                                    97

                                    Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                    ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                    ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                    bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                    bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                    What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

                                    bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

                                    bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

                                    matrices

                                    Explicit (O(nnz)) Implicit (o(nnz))

                                    Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

                                    Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

                                    Indices

                                    Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                    ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                    ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                    bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                    bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                    101

                                    bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

                                    ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

                                    demanded by customers (construction engineers) otherwise they donrsquot believe results

                                    ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

                                    ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

                                    Reproducible Floating Point Computation

                                    Absolute Error for Random Vectors

                                    Same magnitude opposite signs

                                    Intel MKL non-reproducibility

                                    Relative Error for Orthogonal vectors

                                    Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

                                    Sign notreproducible

                                    103

                                    bull Consider summation or dot productbull Goals

                                    1 Same answer independent of layout processors order of summands

                                    2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

                                    bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

                                    GoalsApproaches for Reproducibility

                                    104

                                    Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

                                    Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

                                    Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

                                    bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

                                    Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

                                    bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

                                    Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

                                    Instruments NEC Nokia NVIDIA Samsung Oracle

                                    bull bebopcsberkeleyedu

                                    Summary

                                    Donrsquot Communichellip

                                    106

                                    Time to redesign all linear algebra n-body hellip algorithms and software

                                    (and compilers)

                                    • Implementing Communication-Avoiding Algorithms
                                    • Why avoid communication
                                    • Goals
                                    • Outline
                                    • Outline (2)
                                    • Lower bound for all ldquon3-likerdquo linear algebra
                                    • Lower bound for all ldquon3-likerdquo linear algebra (2)
                                    • Lower bound for all ldquon3-likerdquo linear algebra (3)
                                    • Limits to parallel scaling (12)
                                    • Limits to parallel scaling (22)
                                    • Can we attain these lower bounds
                                    • Outline (3)
                                    • 25D Matrix Multiplication
                                    • 25D Matrix Multiplication (2)
                                    • 25D Matmul on BGP 16K nodes 64K cores (2)
                                    • Perfect Strong Scaling ndash in Time and Energy (12)
                                    • Perfect Strong Scaling ndash in Time and Energy (22)
                                    • Handling Heterogeneity
                                    • Application to Tensor Contractions
                                    • C(ijk) = Σm A(ijm)B(mk)
                                    • Application to Tensor Contractions (2)
                                    • Communication Lower Bounds for Strassen-like matmul algorithms
                                    • vs
                                    • Slide 26
                                    • Strassen-like beyond matmul
                                    • Cache and Network Oblivious Algorithms
                                    • CARMA Performance Distributed Memory
                                    • CARMA Performance Distributed Memory (2)
                                    • CARMA Performance Shared Memory
                                    • CARMA Performance Shared Memory (2)
                                    • Why is CARMA Faster in Shared Memory
                                    • Outline (4)
                                    • One-sided Factorizations (LU QR) so far
                                    • TSQR An Architecture-Dependent Algorithm
                                    • Back to LU Using similar idea for TSLU as TSQR Use reduction
                                    • Minimizing Communication in TSLU
                                    • Making TSLU Numerically Stable
                                    • Stability of LU using TSLU CALU
                                    • Why is stability of TSLU just a ldquoThmrdquo
                                    • Fixing TSLU
                                    • 2D CALU with Tournament Pivoting
                                    • 25D CALU with Tournament Pivoting (c=4 copies)
                                    • Exascale Machine Parameters Source DOE Exascale Workshop
                                    • Exascale predicted speedups for Gaussian Elimination 2D CA
                                    • 25D vs 2D LU With and Without Pivoting
                                    • Other CA algorithms for Ax=b least squares(13)
                                    • Other CA algorithms for Ax=b least squares (23)
                                    • Other CA algorithms for Ax=b least squares (33)
                                    • Outline (5)
                                    • What about sparse matrices (13)
                                    • Performance of 25D APSP using Kleene
                                    • What about sparse matrices (23)
                                    • What about sparse matrices (33)
                                    • Outline (6)
                                    • Symmetric Eigenproblem and SVD
                                    • Slide 58
                                    • Slide 59
                                    • Slide 60
                                    • Slide 61
                                    • Slide 62
                                    • Slide 63
                                    • Slide 64
                                    • Slide 65
                                    • Slide 66
                                    • Slide 67
                                    • Slide 68
                                    • Conventional vs CA - SBR
                                    • Speedups of Sym Band Reduction vs DSBTRD
                                    • Nonsymmetric Eigenproblem
                                    • Attaining the Lower bounds Sequential
                                    • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
                                    • Outline (7)
                                    • Avoiding Communication in Iterative Linear Algebra
                                    • Outline (8)
                                    • Example The Difficulty of Tuning SpMV
                                    • Example The Difficulty of Tuning
                                    • Speedups on Itanium 2 The Need for Search
                                    • Register Profile Itanium 2
                                    • Register Profiles IBM and Intel IA-64
                                    • Another example of tuning challenges for SpMV
                                    • Zoom in to top corner
                                    • 3x3 blocks look natural buthellip
                                    • Extra Work Can Improve Efficiency
                                    • Slide 86
                                    • Slide 87
                                    • Slide 88
                                    • Slide 89
                                    • Summary of Other Performance Optimizations
                                    • Optimized Sparse Kernel Interface - OSKI
                                    • Outline (9)
                                    • Example Classical Conjugate Gradient (CG)
                                    • Example CA-Conjugate Gradient
                                    • Outline (10)
                                    • Slide 96
                                    • Slide 97
                                    • Outline (11)
                                    • What is a ldquosparse matrixrdquo
                                    • Outline (12)
                                    • Reproducible Floating Point Computation
                                    • Intel MKL non-reproducibility
                                    • GoalsApproaches for Reproducibility
                                    • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
                                    • Collaborators and Supporters
                                    • Summary

                                      Application to Tensor Contractions

                                      bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

                                      bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

                                      bull Heavily used in electronic structure calculationsndash Ex NWChem

                                      bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetries

                                      ndash Solomonik Hammond Matthews

                                      C(ijk) = Σm A(ijm)B(mk)

                                      A3-fold symm

                                      B2-fold symm

                                      C2-fold symm

                                      Application to Tensor Contractions

                                      bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

                                      bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

                                      bull Heavily used in electronic structure calculationsndash Ex NWChem for coupled cluster (CC) approach to Schroedinger eqn

                                      bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetriesndash Up to 3x faster running CC than NWChem on 3072 cores of Cray XE6 ndash Solomonik Hammond Matthews

                                      Communication Lower Bounds for Strassen-like matmul algorithms

                                      bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

                                      bull Extends up to M = n2 p2ω bull Extends to rectangular case multiply (mxn)(nxp) in q mults

                                      ndash words_moved = Ω (flopsM^(logmpq -1))

                                      bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz also in JACMbull Is the lower bound attainable

                                      Classical O(n3) matmul

                                      words_moved =Ω (M(nM12)3P)

                                      Strassenrsquos O(nlg7) matmul

                                      words_moved =Ω (M(nM12)lg7P)

                                      Strassen-like O(nω) matmul

                                      words_moved =Ω (M(nM12)ωP)

                                      vs

                                      Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

                                      Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

                                      CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

                                      Communication Avoiding Parallel Strassen (CAPS)

                                      Best way to interleaveBFS and DFS is an tuning parameter

                                      26

                                      Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

                                      Speedups 24-184(over previous Strassen-based algorithms)

                                      Invited to appear as Research Highlight in CACM

                                      Strassen-like beyond matmul

                                      bull Thm (D Dumitriu Holtzrsquo07) Any Strassen-like O(nω) matmul algorithm can be used to build a numerically stable O(nω+η) algorithm for any ηgt0 for Ax=b least squares eig SVD hellipndash ηgt0 needed to deal with numerical stabilityndash Strassen already stable so η=0

                                      bull Thm For sequential versions of these algorithms Words_moved = O(nω+ηM(ω+η)2 ndash 1 + n2 log n) ie attain expected lower bound

                                      Ballard D Holtz Schwartz

                                      Cache and Network Oblivious Algorithms

                                      bull Motivation Minimizes communication at every level of a hierarchical system without tuning parameters (in theory)ndash Not always 25D Matmul on BGP was topology aware

                                      bull CAPS Divide-and-conquer choose BFS or DFS to adapt to processors available memory

                                      bull CARMAndash Divide-and-conquer classical matmul divide largest of 3

                                      dimensions to create two subproblemsndash Choose BFS or DFS to adapt to processors available memory

                                      CARMA Performance Distributed Memory

                                      Square m = k = n = 6144

                                      ScaLAPACK

                                      CARMA

                                      Peak

                                      (log)

                                      (log)

                                      Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA

                                      CARMA Performance Distributed Memory

                                      Inner Product m = n = 192 k = 6291456

                                      ScaLAPACK

                                      CARMAPeak

                                      (log)

                                      (log)

                                      Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA

                                      CARMA Performance Shared Memory

                                      Square m = k = n

                                      MKL (double)CARMA (double)

                                      MKL (single)CARMA (single)

                                      Peak (single)

                                      Peak (double)

                                      (log)

                                      (linear)

                                      Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

                                      CARMA Performance Shared Memory

                                      Inner Product m = n = 64

                                      MKL (double)

                                      CARMA (double)

                                      MKL (single)

                                      CARMA (single)

                                      (log)

                                      (linear)

                                      Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

                                      Why is CARMA Faster in Shared MemoryL3 Cache Misses

                                      Shared Memory Inner Product (m = n = 64 k = 524288)

                                      97 Fewer Misses

                                      86 Fewer Misses

                                      (linear)

                                      Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                      ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                      ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                      bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                      bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                      One-sided Factorizations (LU QR) so farbull Classical Approach for i=1 to n update column i update trailing matrixbull words_moved = O(n3)

                                      35

                                      bull Blocked Approach (LAPACK) for i=1 to nb update block i of b columns update trailing matrixbull words moved = O(n3M13)

                                      bull Recursive Approach func factor(A) if A has 1 column update it

                                      else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)

                                      bull None of these approaches minimizes messagesbull Parallel case Partial

                                      Pivoting =gt n reductionsbull Need another idea

                                      TSQR An Architecture-Dependent Algorithm

                                      W =

                                      W0

                                      W1

                                      W2

                                      W3

                                      R00

                                      R10

                                      R20

                                      R30

                                      R01

                                      R11

                                      R02Parallel

                                      W =

                                      W0

                                      W1

                                      W2

                                      W3

                                      R01 R02

                                      R00

                                      R03

                                      SequentialStreaming

                                      W =

                                      W0

                                      W1

                                      W2

                                      W3

                                      R00

                                      R01R01

                                      R11

                                      R02

                                      R11

                                      R03

                                      Dual Core

                                      Can choose reduction tree dynamicallyMulticore Multisocket Multirack Multisite Out-of-core

                                      Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

                                      Wnxb =

                                      W1

                                      W2

                                      W3

                                      W4

                                      P1middotL1middotU1

                                      P2middotL2middotU2

                                      P3middotL3middotU3

                                      P4middotL4middotU4

                                      =

                                      Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

                                      W1rsquoW2rsquoW3rsquoW4rsquo

                                      P12middotL12middotU12

                                      P34middotL34middotU34

                                      =Choose b pivot rows call them W12rsquo

                                      Choose b pivot rows call them W34rsquo

                                      W12rsquoW34rsquo

                                      = P1234middotL1234middotU1234

                                      Choose b pivot rows

                                      Go back to W and use these b pivot rows (move them to top do LU without pivoting)

                                      37

                                      Minimizing Communication in TSLU

                                      W = W1

                                      W2

                                      W3

                                      W4

                                      LULULULU

                                      LU

                                      LULUParallel

                                      W = W1

                                      W2

                                      W3

                                      W4

                                      LULU

                                      LU

                                      LUSequentialStreaming

                                      W = W1

                                      W2

                                      W3

                                      W4

                                      LULU LU

                                      LULU

                                      LULU

                                      Dual Core

                                      Can choose reduction tree dynamically to match architecture as before

                                      38

                                      Making TSLU Numerically Stable

                                      bull Details matterndash Going up the tree we could do LU either on original rows of A

                                      (tournament pivoting) or computed rows of Undash Only tournament pivoting stable

                                      bull ldquoThmrdquo New scheme as stable as Partial Pivoting (GEPP) in following sense Get same Schur complements as GEPP applied to different input matrix whose entries are blocks taken from input A

                                      bull Why just a ldquoThmrdquo

                                      39

                                      Stability of LU using TSLU CALU

                                      Summer School Lecture 4 40

                                      bull Empirical testingndash Both random matrices and ldquospecial onesrdquondash Both binary tree (BCALU) and flat-tree (FCALU)ndash 3 metrics ||PA-LU||||A|| normwise and componentwise backward errorsndash See [D Grigori Xiang 2010] for details

                                      Why is stability of TSLU just a ldquoThmrdquo

                                      bull Proof is correct ndash in exact arithmeticbull Experiment

                                      ndash Generate 100 random 6x6 rank 3 matrices in Matlabndash [LUP] = lu(A) do LU without pivoting on PA compare L factors are

                                      they the samebull Compute || L ndash Lnp || A few 0rsquos A few infinrsquos a few NaNsbull Rest mostly O(1)

                                      ndash Why Floating point is nonassociative doing arithmetic in different order gives different rounding errors

                                      ndash Same experiment with rank 6 matrices || L ndash Lnp || usually nonzero O(macheps)

                                      ndash Same experiment with 20x20 rank 4 matrices || L ndash Lnp || often O(103)

                                      bull Much harder to break TSLU but possiblendash Occurred when using TSLU to factorize a low-rank subdiagonal

                                      panel in symmetric-indefinite factorization 41

                                      Fixing TSLU

                                      bull Run TSLU quickly test for stability fix if necessary (rare)

                                      bull Test conditioning of U if not tiny (usual case) proceed elsebull Compute || L || if not big (usual case) proceed elsebull Factor A = QR using TSQR thenbull Factor Q = PLU using TSLU thenbull A = PL(UR) with UR as upper triangular factor

                                      bull Last topic in lecture how to guarantee floating point reproducibility

                                      42

                                      2D CALU with Tournament Pivoting

                                      43

                                      25D CALU with Tournament Pivoting (c=4 copies)

                                      44

                                      Exascale Machine ParametersSource DOE Exascale Workshop

                                      bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

                                      Exascale predicted speedupsfor Gaussian Elimination

                                      2D CA-LU vs ScaLAPACK-LU

                                      log2 (p)

                                      log 2

                                      (n2 p

                                      ) =

                                      log 2

                                      (mem

                                      ory_

                                      per_

                                      proc

                                      )

                                      Up to 29x

                                      25D vs 2D LUWith and Without Pivoting

                                      Other CA algorithms for Ax=b least squares(13)

                                      bull A symmetric and indefinitendash Seek factorization that retains symmetry PAPT = LDLT D

                                      ldquosimplerdquobull Save frac12 flops preserve inertia

                                      ndash Usual approach Bunch-Kaufmanbull D block diagonal with 1x1 and 2x2 blocksbull Pivot search down column along row (lots of communication)

                                      ndash Alternative Aasenbull D = tridiagonal = Tbull Two steps

                                      ndash PAPT = LTLT where T is banded using TSLU

                                      48

                                      0 0

                                      0

                                      0 0

                                      0

                                      0

                                      hellip

                                      hellip

                                      ndash Solvefactor narrow band problem with Tbull Up to 28x faster than MKL Best Paper at IPDPSrsquo13

                                      Other CA algorithms for Ax=b least squares (23)bull Minimizing bandwidth and latency for sequential GEPP

                                      ndash So far could not do partial pivoting and minimize messages just words

                                      ndash Challengebull Column layout good for choosing pivots bad for matmulbull Blocked layout good for matmul bad for choosing pivots

                                      ndash Solution use both layouts switching between thembull ldquoShape Morphing LUrdquo or SMLU

                                      49

                                      bull func factor(A) if A has 1 column update it else factor(left half of A)

                                      update right half of A

                                      factor(right half of A)

                                      bull Words = O(n3M12)

                                      bull Messages = O(n3M)

                                      bull func factor(A) if A has 1 column update it else factor(left half of A) reshape to recursive block format update right half of A reshape to columnwise format factor(right half of A)

                                      bull Words = O(n3M12)

                                      bull Messages = O(n3M32)

                                      Other CA algorithms for Ax=b least squares (33)bull Need for pivoting arises beyond LU in QR

                                      ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)

                                      ndash Usual approach like Partial Pivotingbull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound

                                      ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two

                                      groups of b columns either using usual approach or something better (GuEisenstat)

                                      bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix

                                      ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 50

                                      Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                      ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                      ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                      bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                      bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                      What about sparse matrices (13)

                                      bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

                                      bull But canrsquot reorder outer loop for 25D need another ideabull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = AB

                                      ndash Dependencies ok 25D works just different semiringbull Kleenersquos Algorithm

                                      52

                                      for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

                                      D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

                                      Performance of 25D APSP using Kleene

                                      53

                                      Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

                                      62xspeedup

                                      2x speedup

                                      What about sparse matrices (23)

                                      bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of G(A)

                                      have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

                                      bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering for

                                      2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

                                      bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

                                      multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

                                      separators)

                                      54

                                      What about sparse matrices (33)

                                      bull If matrix stays very sparse lower bound unattainable new one bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

                                      data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

                                      bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

                                      Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

                                      bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

                                      along dimensions most likely to minimize cost55

                                      Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                      ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                      ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                      bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                      bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                      Symmetric Eigenproblem and SVD

                                      bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd

                                      bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12

                                      ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea

                                      b+1

                                      b+1

                                      Successive Band Reduction (BischofLangSun)

                                      1

                                      b+1

                                      b+1

                                      d+1

                                      c

                                      Successive Band Reduction (BischofLangSun)

                                      b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                      1Q1

                                      b+1

                                      b+1

                                      d+1

                                      c

                                      b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                      Successive Band Reduction (BischofLangSun)

                                      12

                                      Q1

                                      b+1

                                      b+1

                                      d+1

                                      d+c

                                      d+c

                                      c

                                      b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                      Successive Band Reduction (BischofLangSun)

                                      1

                                      12

                                      Q1

                                      Q1T

                                      b+1

                                      b+1

                                      d+1

                                      d+1

                                      cd+c

                                      d+c

                                      c

                                      b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                      Successive Band Reduction (BischofLangSun)

                                      1

                                      1

                                      2

                                      2Q1

                                      Q1T

                                      b+1

                                      b+1

                                      d+1

                                      d+1

                                      cd+c

                                      d+c

                                      d+c

                                      d+c

                                      c

                                      b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                      Successive Band Reduction (BischofLangSun)

                                      1

                                      1

                                      2

                                      2

                                      3

                                      3

                                      Q1

                                      Q1T

                                      Q2

                                      Q2T

                                      b+1

                                      b+1

                                      d+1

                                      d+1

                                      d+c

                                      d+c

                                      d+c

                                      d+c

                                      c

                                      c

                                      b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                      Successive Band Reduction (BischofLangSun)

                                      1

                                      1

                                      2

                                      2

                                      3

                                      3

                                      4

                                      4

                                      Q1

                                      Q1T

                                      Q2

                                      Q2T

                                      Q3

                                      Q3T

                                      b+1

                                      b+1

                                      d+1

                                      d+1

                                      d+c

                                      d+c

                                      d+c

                                      d+c

                                      c

                                      c

                                      b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                      Successive Band Reduction (BischofLangSun)

                                      1

                                      1

                                      2

                                      2

                                      3

                                      3

                                      4

                                      4

                                      5

                                      5

                                      Q1

                                      Q1T

                                      Q2

                                      Q2T

                                      Q3

                                      Q3T

                                      Q4

                                      Q4T

                                      b+1

                                      b+1

                                      d+1

                                      d+1

                                      c

                                      c

                                      d+c

                                      d+c

                                      d+c

                                      d+c

                                      b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                      Successive Band Reduction (BischofLangSun)

                                      1

                                      1

                                      2

                                      2

                                      3

                                      3

                                      4

                                      4

                                      5

                                      5

                                      Q5T

                                      Q1

                                      Q1T

                                      Q2

                                      Q2T

                                      Q3

                                      Q3T

                                      Q5

                                      Q4

                                      Q4T

                                      b+1

                                      b+1

                                      d+1

                                      d+1

                                      c

                                      c

                                      d+c

                                      d+c

                                      d+c

                                      d+c

                                      b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                      Successive Band Reduction (BischofLangSun)

                                      1

                                      1

                                      2

                                      2

                                      3

                                      3

                                      4

                                      4

                                      5

                                      5

                                      6

                                      6

                                      Q5T

                                      Q1

                                      Q1T

                                      Q2

                                      Q2T

                                      Q3

                                      Q3T

                                      Q5

                                      Q4

                                      Q4T

                                      b+1

                                      b+1

                                      d+1

                                      d+1

                                      c

                                      c

                                      d+c

                                      d+c

                                      d+c

                                      d+c

                                      b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                      Successive Band Reduction (BischofLangSun)

                                      Conventional vs CA - SBR

                                      Conventional Communication-Avoiding

                                      Touch all data 4 times Touch all data once

                                      >
                                      >

                                      Speedups of Sym Band Reductionvs DSBTRD

                                      bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

                                      bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

                                      bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

                                      bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

                                      bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

                                      Nonsymmetric Eigenproblem

                                      bull No apparent way to modify standard algorithmbull Instead Spectral Divide-and-Conquer

                                      ndash Find orthogonal matrix Q whose leading columns span an invariant subspace of A

                                      ndash QTAQ will be block upper triangularndash Apply recursively to A11 A22

                                      ndash Depends on randomization1 Randomized Rank Revealing QR decomposition2 Randomized location to try splitting spectrum

                                      A11 A12

                                      ε A22

                                      Attaining the Lower bounds SequentialLegend[Existing][Ours][Math-Lib][Random]

                                      Two Levels Memory Hierarchy

                                      Words Messages Words Messages

                                      BLAS-3 [FLPRrsquo99][BDLSTrsquo13][MKL etc] [FLPRrsquo99][BDLSTrsquo13][MKL etc]

                                      Cholesky[Grsquo97][APrsquo00]

                                      [LAPACK][BDHSrsquo09]

                                      [Grsquo97][APrsquo00][BDHSrsquo09] [Grsquo97][APrsquo00][BDHSrsquo09]

                                      Sym Indefinite [BBDDDPSTYrsquo13] [BBDDDPSTYrsquo13]

                                      LU[Grsquo97][Trsquo97]

                                      [GDXrsquo11][BDLSTrsquo13]

                                      [GDXrsquo11][BDLSTrsquo13]

                                      [Grsquo97][Trsquo97] [BDLSTrsquo13] [BDLSTrsquo13]

                                      QR[EGrsquo98][FWrsquo03]

                                      [DGHLrsquo12][BDLSTrsquo13]

                                      [FWrsquo03][DGHLrsquo12][BDLSTrsquo13]

                                      [EGrsquo98][FWrsquo03][BDLSTrsquo13]

                                      [FWrsquo03][BDLSTrsquo13]

                                      Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

                                      Sym Eig amp SVD [BDDrsquo11][BDKrsquo13] [BDDrsquo11]

                                      Non Sym Eig [BDDrsquo11] [BDDrsquo11]

                                      Legend[Existing][Ours][Math-Lib][Random]

                                      Words (BW) Messages (L) Saving factor

                                      BLAS-3 [AGZrsquo94][MTrsquo99][ScaLAPACK][Crsquo69][vGWrsquo97][SDrsquo11] L nP12

                                      Cholesky [ScaLAPACK][Trsquo99][SDrsquo11] L nP12

                                      Sym Indefinite [BBDDDPSTYrsquo13][ScaLAPACK] [BBDDDPSTYrsquo13] L nP12

                                      LU [ScaLAPACK][GDXrsquo11][Trsquo99][SDrsquo11] [GDXrsquo11][Trsquo99][SDrsquo11] L nP12

                                      QR [ScaLAPACK][DGHLrsquo12] [Trsquo99] [DGHLrsquo12][Trsquo99] L nP12

                                      Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

                                      Sym Eig amp SVD [BDDrsquo11][BDKrsquo13][ScaLAPACK] [BDDrsquo11][BDKrsquo13] L nP12

                                      Non-Sym Eig [BDDrsquo11] [BDDrsquo11] BW P12 L n

                                      Attaining with extra memory 25D M=(cn2P)

                                      Attaining the Lower bounds Parallel 2DM=(n2P)(Ignoring poly-log(P) factors words = ( n2 P12) messages = (P12)

                                      Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                      ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                      ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                      bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                      bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                      Avoiding Communication in Iterative Linear Algebra

                                      bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

                                      bull Conjugate Gradients (CG) GMRES Lanczos Arnoldi hellip bull Goal minimize communication

                                      ndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

                                      bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

                                      ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

                                      bull Lots of speed up possible (modeled and measured)ndash Price some redundant computationndash Challenges Poor partitioning Preconditioning Num Stability

                                      75

                                      Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                      ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                      ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                      bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                      bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                      Example The Difficulty of Tuning SpMV

                                      bull n = 21200bull nnz = 15 M

                                      bull Source NASA structural analysis problem (raefsky)

                                      77

                                      Example The Difficulty of Tuning

                                      bull n = 21200bull nnz = 15 M

                                      bull Source NASA structural analysis problem (raefsky)

                                      bull 8x8 dense substructure exploit this to limit mem_refs

                                      78

                                      Speedups on Itanium 2 The Need for Search

                                      Reference

                                      Best 4x2

                                      Mflops

                                      Mflops

                                      79

                                      Register Profile Itanium 2

                                      190 Mflops

                                      1190 Mflops

                                      80

                                      Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16

                                      Itanium 2 - 33Itanium 1 - 8

                                      252 Mflops

                                      122 Mflops

                                      820 Mflops

                                      459 Mflops

                                      247 Mflops

                                      107 Mflops

                                      12 Gflops

                                      190 Mflops

                                      Another example of tuning challenges for SpMV

                                      bull Ex11 matrix (fluid flow)

                                      bull More complicated non-zero structure in general

                                      bull N = 16614bull NNZ = 11M

                                      82

                                      Zoom in to top corner

                                      bull More complicated non-zero structure in general

                                      bull N = 16614bull NNZ = 11M

                                      83

                                      3x3 blocks look natural buthellip

                                      bull Example 3x3 blockingndash Logical grid of 3x3 cells

                                      bull But would lead to lots of ldquofill-inrdquo

                                      84

                                      Extra Work Can Improve Efficiency

                                      bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

                                      bull On Pentium III 15x speedupndash Actual mflop rate 152 = 225 higher

                                      85

                                      Source Accelerator Cavity Design Problem (Ko via Husbands)

                                      86

                                      100x100 Submatrix Along Diagonal

                                      Summer School Lecture 787

                                      Post-RCM Reordering

                                      88

                                      Effect of Combined RCM+TSP Reordering

                                      Before Green + RedAfter Green + Blue

                                      Summer School Lecture 789

                                      2x speedups on Pentium 4 Power 4 hellip

                                      Summary of Other Performance Optimizations

                                      bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

                                      bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

                                      bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

                                      90

                                      Optimized Sparse Kernel Interface - OSKI

                                      bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

                                      bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

                                      bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

                                      software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

                                      91

                                      Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                      ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                      ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                      bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                      bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                      93

                                      Example Classical Conjugate Gradient (CG)

                                      SpMVs and dot products require communication in

                                      each iteration

                                      via CA Matrix Powers Kernel

                                      Global reduction to compute G

                                      94

                                      Example CA-Conjugate Gradient

                                      Local computations within inner loop require

                                      no communication

                                      Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                      ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                      ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                      bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                      bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                      96

                                      Slower convergence due

                                      to roundoff

                                      Loss of accuracy due to roundoff

                                      At s = 16 monomial basis is rank deficient Method breaks down

                                      Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

                                      CA-CG (monomial)CG

                                      machine precision

                                      97

                                      Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                      ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                      ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                      bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                      bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                      What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

                                      bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

                                      bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

                                      matrices

                                      Explicit (O(nnz)) Implicit (o(nnz))

                                      Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

                                      Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

                                      Indices

                                      Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                      ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                      ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                      bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                      bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                      101

                                      bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

                                      ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

                                      demanded by customers (construction engineers) otherwise they donrsquot believe results

                                      ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

                                      ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

                                      Reproducible Floating Point Computation

                                      Absolute Error for Random Vectors

                                      Same magnitude opposite signs

                                      Intel MKL non-reproducibility

                                      Relative Error for Orthogonal vectors

                                      Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

                                      Sign notreproducible

                                      103

                                      bull Consider summation or dot productbull Goals

                                      1 Same answer independent of layout processors order of summands

                                      2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

                                      bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

                                      GoalsApproaches for Reproducibility

                                      104

                                      Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

                                      Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

                                      Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

                                      bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

                                      Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

                                      bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

                                      Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

                                      Instruments NEC Nokia NVIDIA Samsung Oracle

                                      bull bebopcsberkeleyedu

                                      Summary

                                      Donrsquot Communichellip

                                      106

                                      Time to redesign all linear algebra n-body hellip algorithms and software

                                      (and compilers)

                                      • Implementing Communication-Avoiding Algorithms
                                      • Why avoid communication
                                      • Goals
                                      • Outline
                                      • Outline (2)
                                      • Lower bound for all ldquon3-likerdquo linear algebra
                                      • Lower bound for all ldquon3-likerdquo linear algebra (2)
                                      • Lower bound for all ldquon3-likerdquo linear algebra (3)
                                      • Limits to parallel scaling (12)
                                      • Limits to parallel scaling (22)
                                      • Can we attain these lower bounds
                                      • Outline (3)
                                      • 25D Matrix Multiplication
                                      • 25D Matrix Multiplication (2)
                                      • 25D Matmul on BGP 16K nodes 64K cores (2)
                                      • Perfect Strong Scaling ndash in Time and Energy (12)
                                      • Perfect Strong Scaling ndash in Time and Energy (22)
                                      • Handling Heterogeneity
                                      • Application to Tensor Contractions
                                      • C(ijk) = Σm A(ijm)B(mk)
                                      • Application to Tensor Contractions (2)
                                      • Communication Lower Bounds for Strassen-like matmul algorithms
                                      • vs
                                      • Slide 26
                                      • Strassen-like beyond matmul
                                      • Cache and Network Oblivious Algorithms
                                      • CARMA Performance Distributed Memory
                                      • CARMA Performance Distributed Memory (2)
                                      • CARMA Performance Shared Memory
                                      • CARMA Performance Shared Memory (2)
                                      • Why is CARMA Faster in Shared Memory
                                      • Outline (4)
                                      • One-sided Factorizations (LU QR) so far
                                      • TSQR An Architecture-Dependent Algorithm
                                      • Back to LU Using similar idea for TSLU as TSQR Use reduction
                                      • Minimizing Communication in TSLU
                                      • Making TSLU Numerically Stable
                                      • Stability of LU using TSLU CALU
                                      • Why is stability of TSLU just a ldquoThmrdquo
                                      • Fixing TSLU
                                      • 2D CALU with Tournament Pivoting
                                      • 25D CALU with Tournament Pivoting (c=4 copies)
                                      • Exascale Machine Parameters Source DOE Exascale Workshop
                                      • Exascale predicted speedups for Gaussian Elimination 2D CA
                                      • 25D vs 2D LU With and Without Pivoting
                                      • Other CA algorithms for Ax=b least squares(13)
                                      • Other CA algorithms for Ax=b least squares (23)
                                      • Other CA algorithms for Ax=b least squares (33)
                                      • Outline (5)
                                      • What about sparse matrices (13)
                                      • Performance of 25D APSP using Kleene
                                      • What about sparse matrices (23)
                                      • What about sparse matrices (33)
                                      • Outline (6)
                                      • Symmetric Eigenproblem and SVD
                                      • Slide 58
                                      • Slide 59
                                      • Slide 60
                                      • Slide 61
                                      • Slide 62
                                      • Slide 63
                                      • Slide 64
                                      • Slide 65
                                      • Slide 66
                                      • Slide 67
                                      • Slide 68
                                      • Conventional vs CA - SBR
                                      • Speedups of Sym Band Reduction vs DSBTRD
                                      • Nonsymmetric Eigenproblem
                                      • Attaining the Lower bounds Sequential
                                      • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
                                      • Outline (7)
                                      • Avoiding Communication in Iterative Linear Algebra
                                      • Outline (8)
                                      • Example The Difficulty of Tuning SpMV
                                      • Example The Difficulty of Tuning
                                      • Speedups on Itanium 2 The Need for Search
                                      • Register Profile Itanium 2
                                      • Register Profiles IBM and Intel IA-64
                                      • Another example of tuning challenges for SpMV
                                      • Zoom in to top corner
                                      • 3x3 blocks look natural buthellip
                                      • Extra Work Can Improve Efficiency
                                      • Slide 86
                                      • Slide 87
                                      • Slide 88
                                      • Slide 89
                                      • Summary of Other Performance Optimizations
                                      • Optimized Sparse Kernel Interface - OSKI
                                      • Outline (9)
                                      • Example Classical Conjugate Gradient (CG)
                                      • Example CA-Conjugate Gradient
                                      • Outline (10)
                                      • Slide 96
                                      • Slide 97
                                      • Outline (11)
                                      • What is a ldquosparse matrixrdquo
                                      • Outline (12)
                                      • Reproducible Floating Point Computation
                                      • Intel MKL non-reproducibility
                                      • GoalsApproaches for Reproducibility
                                      • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
                                      • Collaborators and Supporters
                                      • Summary

                                        C(ijk) = Σm A(ijm)B(mk)

                                        A3-fold symm

                                        B2-fold symm

                                        C2-fold symm

                                        Application to Tensor Contractions

                                        bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

                                        bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

                                        bull Heavily used in electronic structure calculationsndash Ex NWChem for coupled cluster (CC) approach to Schroedinger eqn

                                        bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetriesndash Up to 3x faster running CC than NWChem on 3072 cores of Cray XE6 ndash Solomonik Hammond Matthews

                                        Communication Lower Bounds for Strassen-like matmul algorithms

                                        bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

                                        bull Extends up to M = n2 p2ω bull Extends to rectangular case multiply (mxn)(nxp) in q mults

                                        ndash words_moved = Ω (flopsM^(logmpq -1))

                                        bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz also in JACMbull Is the lower bound attainable

                                        Classical O(n3) matmul

                                        words_moved =Ω (M(nM12)3P)

                                        Strassenrsquos O(nlg7) matmul

                                        words_moved =Ω (M(nM12)lg7P)

                                        Strassen-like O(nω) matmul

                                        words_moved =Ω (M(nM12)ωP)

                                        vs

                                        Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

                                        Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

                                        CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

                                        Communication Avoiding Parallel Strassen (CAPS)

                                        Best way to interleaveBFS and DFS is an tuning parameter

                                        26

                                        Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

                                        Speedups 24-184(over previous Strassen-based algorithms)

                                        Invited to appear as Research Highlight in CACM

                                        Strassen-like beyond matmul

                                        bull Thm (D Dumitriu Holtzrsquo07) Any Strassen-like O(nω) matmul algorithm can be used to build a numerically stable O(nω+η) algorithm for any ηgt0 for Ax=b least squares eig SVD hellipndash ηgt0 needed to deal with numerical stabilityndash Strassen already stable so η=0

                                        bull Thm For sequential versions of these algorithms Words_moved = O(nω+ηM(ω+η)2 ndash 1 + n2 log n) ie attain expected lower bound

                                        Ballard D Holtz Schwartz

                                        Cache and Network Oblivious Algorithms

                                        bull Motivation Minimizes communication at every level of a hierarchical system without tuning parameters (in theory)ndash Not always 25D Matmul on BGP was topology aware

                                        bull CAPS Divide-and-conquer choose BFS or DFS to adapt to processors available memory

                                        bull CARMAndash Divide-and-conquer classical matmul divide largest of 3

                                        dimensions to create two subproblemsndash Choose BFS or DFS to adapt to processors available memory

                                        CARMA Performance Distributed Memory

                                        Square m = k = n = 6144

                                        ScaLAPACK

                                        CARMA

                                        Peak

                                        (log)

                                        (log)

                                        Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA

                                        CARMA Performance Distributed Memory

                                        Inner Product m = n = 192 k = 6291456

                                        ScaLAPACK

                                        CARMAPeak

                                        (log)

                                        (log)

                                        Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA

                                        CARMA Performance Shared Memory

                                        Square m = k = n

                                        MKL (double)CARMA (double)

                                        MKL (single)CARMA (single)

                                        Peak (single)

                                        Peak (double)

                                        (log)

                                        (linear)

                                        Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

                                        CARMA Performance Shared Memory

                                        Inner Product m = n = 64

                                        MKL (double)

                                        CARMA (double)

                                        MKL (single)

                                        CARMA (single)

                                        (log)

                                        (linear)

                                        Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

                                        Why is CARMA Faster in Shared MemoryL3 Cache Misses

                                        Shared Memory Inner Product (m = n = 64 k = 524288)

                                        97 Fewer Misses

                                        86 Fewer Misses

                                        (linear)

                                        Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                        ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                        ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                        bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                        bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                        One-sided Factorizations (LU QR) so farbull Classical Approach for i=1 to n update column i update trailing matrixbull words_moved = O(n3)

                                        35

                                        bull Blocked Approach (LAPACK) for i=1 to nb update block i of b columns update trailing matrixbull words moved = O(n3M13)

                                        bull Recursive Approach func factor(A) if A has 1 column update it

                                        else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)

                                        bull None of these approaches minimizes messagesbull Parallel case Partial

                                        Pivoting =gt n reductionsbull Need another idea

                                        TSQR An Architecture-Dependent Algorithm

                                        W =

                                        W0

                                        W1

                                        W2

                                        W3

                                        R00

                                        R10

                                        R20

                                        R30

                                        R01

                                        R11

                                        R02Parallel

                                        W =

                                        W0

                                        W1

                                        W2

                                        W3

                                        R01 R02

                                        R00

                                        R03

                                        SequentialStreaming

                                        W =

                                        W0

                                        W1

                                        W2

                                        W3

                                        R00

                                        R01R01

                                        R11

                                        R02

                                        R11

                                        R03

                                        Dual Core

                                        Can choose reduction tree dynamicallyMulticore Multisocket Multirack Multisite Out-of-core

                                        Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

                                        Wnxb =

                                        W1

                                        W2

                                        W3

                                        W4

                                        P1middotL1middotU1

                                        P2middotL2middotU2

                                        P3middotL3middotU3

                                        P4middotL4middotU4

                                        =

                                        Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

                                        W1rsquoW2rsquoW3rsquoW4rsquo

                                        P12middotL12middotU12

                                        P34middotL34middotU34

                                        =Choose b pivot rows call them W12rsquo

                                        Choose b pivot rows call them W34rsquo

                                        W12rsquoW34rsquo

                                        = P1234middotL1234middotU1234

                                        Choose b pivot rows

                                        Go back to W and use these b pivot rows (move them to top do LU without pivoting)

                                        37

                                        Minimizing Communication in TSLU

                                        W = W1

                                        W2

                                        W3

                                        W4

                                        LULULULU

                                        LU

                                        LULUParallel

                                        W = W1

                                        W2

                                        W3

                                        W4

                                        LULU

                                        LU

                                        LUSequentialStreaming

                                        W = W1

                                        W2

                                        W3

                                        W4

                                        LULU LU

                                        LULU

                                        LULU

                                        Dual Core

                                        Can choose reduction tree dynamically to match architecture as before

                                        38

                                        Making TSLU Numerically Stable

                                        bull Details matterndash Going up the tree we could do LU either on original rows of A

                                        (tournament pivoting) or computed rows of Undash Only tournament pivoting stable

                                        bull ldquoThmrdquo New scheme as stable as Partial Pivoting (GEPP) in following sense Get same Schur complements as GEPP applied to different input matrix whose entries are blocks taken from input A

                                        bull Why just a ldquoThmrdquo

                                        39

                                        Stability of LU using TSLU CALU

                                        Summer School Lecture 4 40

                                        bull Empirical testingndash Both random matrices and ldquospecial onesrdquondash Both binary tree (BCALU) and flat-tree (FCALU)ndash 3 metrics ||PA-LU||||A|| normwise and componentwise backward errorsndash See [D Grigori Xiang 2010] for details

                                        Why is stability of TSLU just a ldquoThmrdquo

                                        bull Proof is correct ndash in exact arithmeticbull Experiment

                                        ndash Generate 100 random 6x6 rank 3 matrices in Matlabndash [LUP] = lu(A) do LU without pivoting on PA compare L factors are

                                        they the samebull Compute || L ndash Lnp || A few 0rsquos A few infinrsquos a few NaNsbull Rest mostly O(1)

                                        ndash Why Floating point is nonassociative doing arithmetic in different order gives different rounding errors

                                        ndash Same experiment with rank 6 matrices || L ndash Lnp || usually nonzero O(macheps)

                                        ndash Same experiment with 20x20 rank 4 matrices || L ndash Lnp || often O(103)

                                        bull Much harder to break TSLU but possiblendash Occurred when using TSLU to factorize a low-rank subdiagonal

                                        panel in symmetric-indefinite factorization 41

                                        Fixing TSLU

                                        bull Run TSLU quickly test for stability fix if necessary (rare)

                                        bull Test conditioning of U if not tiny (usual case) proceed elsebull Compute || L || if not big (usual case) proceed elsebull Factor A = QR using TSQR thenbull Factor Q = PLU using TSLU thenbull A = PL(UR) with UR as upper triangular factor

                                        bull Last topic in lecture how to guarantee floating point reproducibility

                                        42

                                        2D CALU with Tournament Pivoting

                                        43

                                        25D CALU with Tournament Pivoting (c=4 copies)

                                        44

                                        Exascale Machine ParametersSource DOE Exascale Workshop

                                        bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

                                        Exascale predicted speedupsfor Gaussian Elimination

                                        2D CA-LU vs ScaLAPACK-LU

                                        log2 (p)

                                        log 2

                                        (n2 p

                                        ) =

                                        log 2

                                        (mem

                                        ory_

                                        per_

                                        proc

                                        )

                                        Up to 29x

                                        25D vs 2D LUWith and Without Pivoting

                                        Other CA algorithms for Ax=b least squares(13)

                                        bull A symmetric and indefinitendash Seek factorization that retains symmetry PAPT = LDLT D

                                        ldquosimplerdquobull Save frac12 flops preserve inertia

                                        ndash Usual approach Bunch-Kaufmanbull D block diagonal with 1x1 and 2x2 blocksbull Pivot search down column along row (lots of communication)

                                        ndash Alternative Aasenbull D = tridiagonal = Tbull Two steps

                                        ndash PAPT = LTLT where T is banded using TSLU

                                        48

                                        0 0

                                        0

                                        0 0

                                        0

                                        0

                                        hellip

                                        hellip

                                        ndash Solvefactor narrow band problem with Tbull Up to 28x faster than MKL Best Paper at IPDPSrsquo13

                                        Other CA algorithms for Ax=b least squares (23)bull Minimizing bandwidth and latency for sequential GEPP

                                        ndash So far could not do partial pivoting and minimize messages just words

                                        ndash Challengebull Column layout good for choosing pivots bad for matmulbull Blocked layout good for matmul bad for choosing pivots

                                        ndash Solution use both layouts switching between thembull ldquoShape Morphing LUrdquo or SMLU

                                        49

                                        bull func factor(A) if A has 1 column update it else factor(left half of A)

                                        update right half of A

                                        factor(right half of A)

                                        bull Words = O(n3M12)

                                        bull Messages = O(n3M)

                                        bull func factor(A) if A has 1 column update it else factor(left half of A) reshape to recursive block format update right half of A reshape to columnwise format factor(right half of A)

                                        bull Words = O(n3M12)

                                        bull Messages = O(n3M32)

                                        Other CA algorithms for Ax=b least squares (33)bull Need for pivoting arises beyond LU in QR

                                        ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)

                                        ndash Usual approach like Partial Pivotingbull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound

                                        ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two

                                        groups of b columns either using usual approach or something better (GuEisenstat)

                                        bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix

                                        ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 50

                                        Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                        ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                        ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                        bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                        bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                        What about sparse matrices (13)

                                        bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

                                        bull But canrsquot reorder outer loop for 25D need another ideabull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = AB

                                        ndash Dependencies ok 25D works just different semiringbull Kleenersquos Algorithm

                                        52

                                        for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

                                        D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

                                        Performance of 25D APSP using Kleene

                                        53

                                        Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

                                        62xspeedup

                                        2x speedup

                                        What about sparse matrices (23)

                                        bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of G(A)

                                        have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

                                        bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering for

                                        2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

                                        bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

                                        multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

                                        separators)

                                        54

                                        What about sparse matrices (33)

                                        bull If matrix stays very sparse lower bound unattainable new one bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

                                        data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

                                        bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

                                        Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

                                        bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

                                        along dimensions most likely to minimize cost55

                                        Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                        ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                        ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                        bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                        bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                        Symmetric Eigenproblem and SVD

                                        bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd

                                        bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12

                                        ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea

                                        b+1

                                        b+1

                                        Successive Band Reduction (BischofLangSun)

                                        1

                                        b+1

                                        b+1

                                        d+1

                                        c

                                        Successive Band Reduction (BischofLangSun)

                                        b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                        1Q1

                                        b+1

                                        b+1

                                        d+1

                                        c

                                        b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                        Successive Band Reduction (BischofLangSun)

                                        12

                                        Q1

                                        b+1

                                        b+1

                                        d+1

                                        d+c

                                        d+c

                                        c

                                        b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                        Successive Band Reduction (BischofLangSun)

                                        1

                                        12

                                        Q1

                                        Q1T

                                        b+1

                                        b+1

                                        d+1

                                        d+1

                                        cd+c

                                        d+c

                                        c

                                        b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                        Successive Band Reduction (BischofLangSun)

                                        1

                                        1

                                        2

                                        2Q1

                                        Q1T

                                        b+1

                                        b+1

                                        d+1

                                        d+1

                                        cd+c

                                        d+c

                                        d+c

                                        d+c

                                        c

                                        b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                        Successive Band Reduction (BischofLangSun)

                                        1

                                        1

                                        2

                                        2

                                        3

                                        3

                                        Q1

                                        Q1T

                                        Q2

                                        Q2T

                                        b+1

                                        b+1

                                        d+1

                                        d+1

                                        d+c

                                        d+c

                                        d+c

                                        d+c

                                        c

                                        c

                                        b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                        Successive Band Reduction (BischofLangSun)

                                        1

                                        1

                                        2

                                        2

                                        3

                                        3

                                        4

                                        4

                                        Q1

                                        Q1T

                                        Q2

                                        Q2T

                                        Q3

                                        Q3T

                                        b+1

                                        b+1

                                        d+1

                                        d+1

                                        d+c

                                        d+c

                                        d+c

                                        d+c

                                        c

                                        c

                                        b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                        Successive Band Reduction (BischofLangSun)

                                        1

                                        1

                                        2

                                        2

                                        3

                                        3

                                        4

                                        4

                                        5

                                        5

                                        Q1

                                        Q1T

                                        Q2

                                        Q2T

                                        Q3

                                        Q3T

                                        Q4

                                        Q4T

                                        b+1

                                        b+1

                                        d+1

                                        d+1

                                        c

                                        c

                                        d+c

                                        d+c

                                        d+c

                                        d+c

                                        b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                        Successive Band Reduction (BischofLangSun)

                                        1

                                        1

                                        2

                                        2

                                        3

                                        3

                                        4

                                        4

                                        5

                                        5

                                        Q5T

                                        Q1

                                        Q1T

                                        Q2

                                        Q2T

                                        Q3

                                        Q3T

                                        Q5

                                        Q4

                                        Q4T

                                        b+1

                                        b+1

                                        d+1

                                        d+1

                                        c

                                        c

                                        d+c

                                        d+c

                                        d+c

                                        d+c

                                        b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                        Successive Band Reduction (BischofLangSun)

                                        1

                                        1

                                        2

                                        2

                                        3

                                        3

                                        4

                                        4

                                        5

                                        5

                                        6

                                        6

                                        Q5T

                                        Q1

                                        Q1T

                                        Q2

                                        Q2T

                                        Q3

                                        Q3T

                                        Q5

                                        Q4

                                        Q4T

                                        b+1

                                        b+1

                                        d+1

                                        d+1

                                        c

                                        c

                                        d+c

                                        d+c

                                        d+c

                                        d+c

                                        b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                        Successive Band Reduction (BischofLangSun)

                                        Conventional vs CA - SBR

                                        Conventional Communication-Avoiding

                                        Touch all data 4 times Touch all data once

                                        >
                                        >

                                        Speedups of Sym Band Reductionvs DSBTRD

                                        bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

                                        bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

                                        bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

                                        bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

                                        bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

                                        Nonsymmetric Eigenproblem

                                        bull No apparent way to modify standard algorithmbull Instead Spectral Divide-and-Conquer

                                        ndash Find orthogonal matrix Q whose leading columns span an invariant subspace of A

                                        ndash QTAQ will be block upper triangularndash Apply recursively to A11 A22

                                        ndash Depends on randomization1 Randomized Rank Revealing QR decomposition2 Randomized location to try splitting spectrum

                                        A11 A12

                                        ε A22

                                        Attaining the Lower bounds SequentialLegend[Existing][Ours][Math-Lib][Random]

                                        Two Levels Memory Hierarchy

                                        Words Messages Words Messages

                                        BLAS-3 [FLPRrsquo99][BDLSTrsquo13][MKL etc] [FLPRrsquo99][BDLSTrsquo13][MKL etc]

                                        Cholesky[Grsquo97][APrsquo00]

                                        [LAPACK][BDHSrsquo09]

                                        [Grsquo97][APrsquo00][BDHSrsquo09] [Grsquo97][APrsquo00][BDHSrsquo09]

                                        Sym Indefinite [BBDDDPSTYrsquo13] [BBDDDPSTYrsquo13]

                                        LU[Grsquo97][Trsquo97]

                                        [GDXrsquo11][BDLSTrsquo13]

                                        [GDXrsquo11][BDLSTrsquo13]

                                        [Grsquo97][Trsquo97] [BDLSTrsquo13] [BDLSTrsquo13]

                                        QR[EGrsquo98][FWrsquo03]

                                        [DGHLrsquo12][BDLSTrsquo13]

                                        [FWrsquo03][DGHLrsquo12][BDLSTrsquo13]

                                        [EGrsquo98][FWrsquo03][BDLSTrsquo13]

                                        [FWrsquo03][BDLSTrsquo13]

                                        Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

                                        Sym Eig amp SVD [BDDrsquo11][BDKrsquo13] [BDDrsquo11]

                                        Non Sym Eig [BDDrsquo11] [BDDrsquo11]

                                        Legend[Existing][Ours][Math-Lib][Random]

                                        Words (BW) Messages (L) Saving factor

                                        BLAS-3 [AGZrsquo94][MTrsquo99][ScaLAPACK][Crsquo69][vGWrsquo97][SDrsquo11] L nP12

                                        Cholesky [ScaLAPACK][Trsquo99][SDrsquo11] L nP12

                                        Sym Indefinite [BBDDDPSTYrsquo13][ScaLAPACK] [BBDDDPSTYrsquo13] L nP12

                                        LU [ScaLAPACK][GDXrsquo11][Trsquo99][SDrsquo11] [GDXrsquo11][Trsquo99][SDrsquo11] L nP12

                                        QR [ScaLAPACK][DGHLrsquo12] [Trsquo99] [DGHLrsquo12][Trsquo99] L nP12

                                        Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

                                        Sym Eig amp SVD [BDDrsquo11][BDKrsquo13][ScaLAPACK] [BDDrsquo11][BDKrsquo13] L nP12

                                        Non-Sym Eig [BDDrsquo11] [BDDrsquo11] BW P12 L n

                                        Attaining with extra memory 25D M=(cn2P)

                                        Attaining the Lower bounds Parallel 2DM=(n2P)(Ignoring poly-log(P) factors words = ( n2 P12) messages = (P12)

                                        Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                        ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                        ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                        bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                        bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                        Avoiding Communication in Iterative Linear Algebra

                                        bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

                                        bull Conjugate Gradients (CG) GMRES Lanczos Arnoldi hellip bull Goal minimize communication

                                        ndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

                                        bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

                                        ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

                                        bull Lots of speed up possible (modeled and measured)ndash Price some redundant computationndash Challenges Poor partitioning Preconditioning Num Stability

                                        75

                                        Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                        ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                        ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                        bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                        bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                        Example The Difficulty of Tuning SpMV

                                        bull n = 21200bull nnz = 15 M

                                        bull Source NASA structural analysis problem (raefsky)

                                        77

                                        Example The Difficulty of Tuning

                                        bull n = 21200bull nnz = 15 M

                                        bull Source NASA structural analysis problem (raefsky)

                                        bull 8x8 dense substructure exploit this to limit mem_refs

                                        78

                                        Speedups on Itanium 2 The Need for Search

                                        Reference

                                        Best 4x2

                                        Mflops

                                        Mflops

                                        79

                                        Register Profile Itanium 2

                                        190 Mflops

                                        1190 Mflops

                                        80

                                        Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16

                                        Itanium 2 - 33Itanium 1 - 8

                                        252 Mflops

                                        122 Mflops

                                        820 Mflops

                                        459 Mflops

                                        247 Mflops

                                        107 Mflops

                                        12 Gflops

                                        190 Mflops

                                        Another example of tuning challenges for SpMV

                                        bull Ex11 matrix (fluid flow)

                                        bull More complicated non-zero structure in general

                                        bull N = 16614bull NNZ = 11M

                                        82

                                        Zoom in to top corner

                                        bull More complicated non-zero structure in general

                                        bull N = 16614bull NNZ = 11M

                                        83

                                        3x3 blocks look natural buthellip

                                        bull Example 3x3 blockingndash Logical grid of 3x3 cells

                                        bull But would lead to lots of ldquofill-inrdquo

                                        84

                                        Extra Work Can Improve Efficiency

                                        bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

                                        bull On Pentium III 15x speedupndash Actual mflop rate 152 = 225 higher

                                        85

                                        Source Accelerator Cavity Design Problem (Ko via Husbands)

                                        86

                                        100x100 Submatrix Along Diagonal

                                        Summer School Lecture 787

                                        Post-RCM Reordering

                                        88

                                        Effect of Combined RCM+TSP Reordering

                                        Before Green + RedAfter Green + Blue

                                        Summer School Lecture 789

                                        2x speedups on Pentium 4 Power 4 hellip

                                        Summary of Other Performance Optimizations

                                        bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

                                        bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

                                        bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

                                        90

                                        Optimized Sparse Kernel Interface - OSKI

                                        bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

                                        bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

                                        bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

                                        software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

                                        91

                                        Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                        ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                        ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                        bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                        bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                        93

                                        Example Classical Conjugate Gradient (CG)

                                        SpMVs and dot products require communication in

                                        each iteration

                                        via CA Matrix Powers Kernel

                                        Global reduction to compute G

                                        94

                                        Example CA-Conjugate Gradient

                                        Local computations within inner loop require

                                        no communication

                                        Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                        ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                        ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                        bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                        bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                        96

                                        Slower convergence due

                                        to roundoff

                                        Loss of accuracy due to roundoff

                                        At s = 16 monomial basis is rank deficient Method breaks down

                                        Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

                                        CA-CG (monomial)CG

                                        machine precision

                                        97

                                        Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                        ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                        ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                        bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                        bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                        What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

                                        bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

                                        bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

                                        matrices

                                        Explicit (O(nnz)) Implicit (o(nnz))

                                        Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

                                        Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

                                        Indices

                                        Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                        ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                        ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                        bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                        bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                        101

                                        bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

                                        ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

                                        demanded by customers (construction engineers) otherwise they donrsquot believe results

                                        ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

                                        ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

                                        Reproducible Floating Point Computation

                                        Absolute Error for Random Vectors

                                        Same magnitude opposite signs

                                        Intel MKL non-reproducibility

                                        Relative Error for Orthogonal vectors

                                        Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

                                        Sign notreproducible

                                        103

                                        bull Consider summation or dot productbull Goals

                                        1 Same answer independent of layout processors order of summands

                                        2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

                                        bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

                                        GoalsApproaches for Reproducibility

                                        104

                                        Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

                                        Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

                                        Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

                                        bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

                                        Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

                                        bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

                                        Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

                                        Instruments NEC Nokia NVIDIA Samsung Oracle

                                        bull bebopcsberkeleyedu

                                        Summary

                                        Donrsquot Communichellip

                                        106

                                        Time to redesign all linear algebra n-body hellip algorithms and software

                                        (and compilers)

                                        • Implementing Communication-Avoiding Algorithms
                                        • Why avoid communication
                                        • Goals
                                        • Outline
                                        • Outline (2)
                                        • Lower bound for all ldquon3-likerdquo linear algebra
                                        • Lower bound for all ldquon3-likerdquo linear algebra (2)
                                        • Lower bound for all ldquon3-likerdquo linear algebra (3)
                                        • Limits to parallel scaling (12)
                                        • Limits to parallel scaling (22)
                                        • Can we attain these lower bounds
                                        • Outline (3)
                                        • 25D Matrix Multiplication
                                        • 25D Matrix Multiplication (2)
                                        • 25D Matmul on BGP 16K nodes 64K cores (2)
                                        • Perfect Strong Scaling ndash in Time and Energy (12)
                                        • Perfect Strong Scaling ndash in Time and Energy (22)
                                        • Handling Heterogeneity
                                        • Application to Tensor Contractions
                                        • C(ijk) = Σm A(ijm)B(mk)
                                        • Application to Tensor Contractions (2)
                                        • Communication Lower Bounds for Strassen-like matmul algorithms
                                        • vs
                                        • Slide 26
                                        • Strassen-like beyond matmul
                                        • Cache and Network Oblivious Algorithms
                                        • CARMA Performance Distributed Memory
                                        • CARMA Performance Distributed Memory (2)
                                        • CARMA Performance Shared Memory
                                        • CARMA Performance Shared Memory (2)
                                        • Why is CARMA Faster in Shared Memory
                                        • Outline (4)
                                        • One-sided Factorizations (LU QR) so far
                                        • TSQR An Architecture-Dependent Algorithm
                                        • Back to LU Using similar idea for TSLU as TSQR Use reduction
                                        • Minimizing Communication in TSLU
                                        • Making TSLU Numerically Stable
                                        • Stability of LU using TSLU CALU
                                        • Why is stability of TSLU just a ldquoThmrdquo
                                        • Fixing TSLU
                                        • 2D CALU with Tournament Pivoting
                                        • 25D CALU with Tournament Pivoting (c=4 copies)
                                        • Exascale Machine Parameters Source DOE Exascale Workshop
                                        • Exascale predicted speedups for Gaussian Elimination 2D CA
                                        • 25D vs 2D LU With and Without Pivoting
                                        • Other CA algorithms for Ax=b least squares(13)
                                        • Other CA algorithms for Ax=b least squares (23)
                                        • Other CA algorithms for Ax=b least squares (33)
                                        • Outline (5)
                                        • What about sparse matrices (13)
                                        • Performance of 25D APSP using Kleene
                                        • What about sparse matrices (23)
                                        • What about sparse matrices (33)
                                        • Outline (6)
                                        • Symmetric Eigenproblem and SVD
                                        • Slide 58
                                        • Slide 59
                                        • Slide 60
                                        • Slide 61
                                        • Slide 62
                                        • Slide 63
                                        • Slide 64
                                        • Slide 65
                                        • Slide 66
                                        • Slide 67
                                        • Slide 68
                                        • Conventional vs CA - SBR
                                        • Speedups of Sym Band Reduction vs DSBTRD
                                        • Nonsymmetric Eigenproblem
                                        • Attaining the Lower bounds Sequential
                                        • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
                                        • Outline (7)
                                        • Avoiding Communication in Iterative Linear Algebra
                                        • Outline (8)
                                        • Example The Difficulty of Tuning SpMV
                                        • Example The Difficulty of Tuning
                                        • Speedups on Itanium 2 The Need for Search
                                        • Register Profile Itanium 2
                                        • Register Profiles IBM and Intel IA-64
                                        • Another example of tuning challenges for SpMV
                                        • Zoom in to top corner
                                        • 3x3 blocks look natural buthellip
                                        • Extra Work Can Improve Efficiency
                                        • Slide 86
                                        • Slide 87
                                        • Slide 88
                                        • Slide 89
                                        • Summary of Other Performance Optimizations
                                        • Optimized Sparse Kernel Interface - OSKI
                                        • Outline (9)
                                        • Example Classical Conjugate Gradient (CG)
                                        • Example CA-Conjugate Gradient
                                        • Outline (10)
                                        • Slide 96
                                        • Slide 97
                                        • Outline (11)
                                        • What is a ldquosparse matrixrdquo
                                        • Outline (12)
                                        • Reproducible Floating Point Computation
                                        • Intel MKL non-reproducibility
                                        • GoalsApproaches for Reproducibility
                                        • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
                                        • Collaborators and Supporters
                                        • Summary

                                          Application to Tensor Contractions

                                          bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

                                          bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

                                          bull Heavily used in electronic structure calculationsndash Ex NWChem for coupled cluster (CC) approach to Schroedinger eqn

                                          bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetriesndash Up to 3x faster running CC than NWChem on 3072 cores of Cray XE6 ndash Solomonik Hammond Matthews

                                          Communication Lower Bounds for Strassen-like matmul algorithms

                                          bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

                                          bull Extends up to M = n2 p2ω bull Extends to rectangular case multiply (mxn)(nxp) in q mults

                                          ndash words_moved = Ω (flopsM^(logmpq -1))

                                          bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz also in JACMbull Is the lower bound attainable

                                          Classical O(n3) matmul

                                          words_moved =Ω (M(nM12)3P)

                                          Strassenrsquos O(nlg7) matmul

                                          words_moved =Ω (M(nM12)lg7P)

                                          Strassen-like O(nω) matmul

                                          words_moved =Ω (M(nM12)ωP)

                                          vs

                                          Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

                                          Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

                                          CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

                                          Communication Avoiding Parallel Strassen (CAPS)

                                          Best way to interleaveBFS and DFS is an tuning parameter

                                          26

                                          Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

                                          Speedups 24-184(over previous Strassen-based algorithms)

                                          Invited to appear as Research Highlight in CACM

                                          Strassen-like beyond matmul

                                          bull Thm (D Dumitriu Holtzrsquo07) Any Strassen-like O(nω) matmul algorithm can be used to build a numerically stable O(nω+η) algorithm for any ηgt0 for Ax=b least squares eig SVD hellipndash ηgt0 needed to deal with numerical stabilityndash Strassen already stable so η=0

                                          bull Thm For sequential versions of these algorithms Words_moved = O(nω+ηM(ω+η)2 ndash 1 + n2 log n) ie attain expected lower bound

                                          Ballard D Holtz Schwartz

                                          Cache and Network Oblivious Algorithms

                                          bull Motivation Minimizes communication at every level of a hierarchical system without tuning parameters (in theory)ndash Not always 25D Matmul on BGP was topology aware

                                          bull CAPS Divide-and-conquer choose BFS or DFS to adapt to processors available memory

                                          bull CARMAndash Divide-and-conquer classical matmul divide largest of 3

                                          dimensions to create two subproblemsndash Choose BFS or DFS to adapt to processors available memory

                                          CARMA Performance Distributed Memory

                                          Square m = k = n = 6144

                                          ScaLAPACK

                                          CARMA

                                          Peak

                                          (log)

                                          (log)

                                          Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA

                                          CARMA Performance Distributed Memory

                                          Inner Product m = n = 192 k = 6291456

                                          ScaLAPACK

                                          CARMAPeak

                                          (log)

                                          (log)

                                          Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA

                                          CARMA Performance Shared Memory

                                          Square m = k = n

                                          MKL (double)CARMA (double)

                                          MKL (single)CARMA (single)

                                          Peak (single)

                                          Peak (double)

                                          (log)

                                          (linear)

                                          Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

                                          CARMA Performance Shared Memory

                                          Inner Product m = n = 64

                                          MKL (double)

                                          CARMA (double)

                                          MKL (single)

                                          CARMA (single)

                                          (log)

                                          (linear)

                                          Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

                                          Why is CARMA Faster in Shared MemoryL3 Cache Misses

                                          Shared Memory Inner Product (m = n = 64 k = 524288)

                                          97 Fewer Misses

                                          86 Fewer Misses

                                          (linear)

                                          Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                          ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                          ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                          bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                          bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                          One-sided Factorizations (LU QR) so farbull Classical Approach for i=1 to n update column i update trailing matrixbull words_moved = O(n3)

                                          35

                                          bull Blocked Approach (LAPACK) for i=1 to nb update block i of b columns update trailing matrixbull words moved = O(n3M13)

                                          bull Recursive Approach func factor(A) if A has 1 column update it

                                          else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)

                                          bull None of these approaches minimizes messagesbull Parallel case Partial

                                          Pivoting =gt n reductionsbull Need another idea

                                          TSQR An Architecture-Dependent Algorithm

                                          W =

                                          W0

                                          W1

                                          W2

                                          W3

                                          R00

                                          R10

                                          R20

                                          R30

                                          R01

                                          R11

                                          R02Parallel

                                          W =

                                          W0

                                          W1

                                          W2

                                          W3

                                          R01 R02

                                          R00

                                          R03

                                          SequentialStreaming

                                          W =

                                          W0

                                          W1

                                          W2

                                          W3

                                          R00

                                          R01R01

                                          R11

                                          R02

                                          R11

                                          R03

                                          Dual Core

                                          Can choose reduction tree dynamicallyMulticore Multisocket Multirack Multisite Out-of-core

                                          Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

                                          Wnxb =

                                          W1

                                          W2

                                          W3

                                          W4

                                          P1middotL1middotU1

                                          P2middotL2middotU2

                                          P3middotL3middotU3

                                          P4middotL4middotU4

                                          =

                                          Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

                                          W1rsquoW2rsquoW3rsquoW4rsquo

                                          P12middotL12middotU12

                                          P34middotL34middotU34

                                          =Choose b pivot rows call them W12rsquo

                                          Choose b pivot rows call them W34rsquo

                                          W12rsquoW34rsquo

                                          = P1234middotL1234middotU1234

                                          Choose b pivot rows

                                          Go back to W and use these b pivot rows (move them to top do LU without pivoting)

                                          37

                                          Minimizing Communication in TSLU

                                          W = W1

                                          W2

                                          W3

                                          W4

                                          LULULULU

                                          LU

                                          LULUParallel

                                          W = W1

                                          W2

                                          W3

                                          W4

                                          LULU

                                          LU

                                          LUSequentialStreaming

                                          W = W1

                                          W2

                                          W3

                                          W4

                                          LULU LU

                                          LULU

                                          LULU

                                          Dual Core

                                          Can choose reduction tree dynamically to match architecture as before

                                          38

                                          Making TSLU Numerically Stable

                                          bull Details matterndash Going up the tree we could do LU either on original rows of A

                                          (tournament pivoting) or computed rows of Undash Only tournament pivoting stable

                                          bull ldquoThmrdquo New scheme as stable as Partial Pivoting (GEPP) in following sense Get same Schur complements as GEPP applied to different input matrix whose entries are blocks taken from input A

                                          bull Why just a ldquoThmrdquo

                                          39

                                          Stability of LU using TSLU CALU

                                          Summer School Lecture 4 40

                                          bull Empirical testingndash Both random matrices and ldquospecial onesrdquondash Both binary tree (BCALU) and flat-tree (FCALU)ndash 3 metrics ||PA-LU||||A|| normwise and componentwise backward errorsndash See [D Grigori Xiang 2010] for details

                                          Why is stability of TSLU just a ldquoThmrdquo

                                          bull Proof is correct ndash in exact arithmeticbull Experiment

                                          ndash Generate 100 random 6x6 rank 3 matrices in Matlabndash [LUP] = lu(A) do LU without pivoting on PA compare L factors are

                                          they the samebull Compute || L ndash Lnp || A few 0rsquos A few infinrsquos a few NaNsbull Rest mostly O(1)

                                          ndash Why Floating point is nonassociative doing arithmetic in different order gives different rounding errors

                                          ndash Same experiment with rank 6 matrices || L ndash Lnp || usually nonzero O(macheps)

                                          ndash Same experiment with 20x20 rank 4 matrices || L ndash Lnp || often O(103)

                                          bull Much harder to break TSLU but possiblendash Occurred when using TSLU to factorize a low-rank subdiagonal

                                          panel in symmetric-indefinite factorization 41

                                          Fixing TSLU

                                          bull Run TSLU quickly test for stability fix if necessary (rare)

                                          bull Test conditioning of U if not tiny (usual case) proceed elsebull Compute || L || if not big (usual case) proceed elsebull Factor A = QR using TSQR thenbull Factor Q = PLU using TSLU thenbull A = PL(UR) with UR as upper triangular factor

                                          bull Last topic in lecture how to guarantee floating point reproducibility

                                          42

                                          2D CALU with Tournament Pivoting

                                          43

                                          25D CALU with Tournament Pivoting (c=4 copies)

                                          44

                                          Exascale Machine ParametersSource DOE Exascale Workshop

                                          bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

                                          Exascale predicted speedupsfor Gaussian Elimination

                                          2D CA-LU vs ScaLAPACK-LU

                                          log2 (p)

                                          log 2

                                          (n2 p

                                          ) =

                                          log 2

                                          (mem

                                          ory_

                                          per_

                                          proc

                                          )

                                          Up to 29x

                                          25D vs 2D LUWith and Without Pivoting

                                          Other CA algorithms for Ax=b least squares(13)

                                          bull A symmetric and indefinitendash Seek factorization that retains symmetry PAPT = LDLT D

                                          ldquosimplerdquobull Save frac12 flops preserve inertia

                                          ndash Usual approach Bunch-Kaufmanbull D block diagonal with 1x1 and 2x2 blocksbull Pivot search down column along row (lots of communication)

                                          ndash Alternative Aasenbull D = tridiagonal = Tbull Two steps

                                          ndash PAPT = LTLT where T is banded using TSLU

                                          48

                                          0 0

                                          0

                                          0 0

                                          0

                                          0

                                          hellip

                                          hellip

                                          ndash Solvefactor narrow band problem with Tbull Up to 28x faster than MKL Best Paper at IPDPSrsquo13

                                          Other CA algorithms for Ax=b least squares (23)bull Minimizing bandwidth and latency for sequential GEPP

                                          ndash So far could not do partial pivoting and minimize messages just words

                                          ndash Challengebull Column layout good for choosing pivots bad for matmulbull Blocked layout good for matmul bad for choosing pivots

                                          ndash Solution use both layouts switching between thembull ldquoShape Morphing LUrdquo or SMLU

                                          49

                                          bull func factor(A) if A has 1 column update it else factor(left half of A)

                                          update right half of A

                                          factor(right half of A)

                                          bull Words = O(n3M12)

                                          bull Messages = O(n3M)

                                          bull func factor(A) if A has 1 column update it else factor(left half of A) reshape to recursive block format update right half of A reshape to columnwise format factor(right half of A)

                                          bull Words = O(n3M12)

                                          bull Messages = O(n3M32)

                                          Other CA algorithms for Ax=b least squares (33)bull Need for pivoting arises beyond LU in QR

                                          ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)

                                          ndash Usual approach like Partial Pivotingbull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound

                                          ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two

                                          groups of b columns either using usual approach or something better (GuEisenstat)

                                          bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix

                                          ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 50

                                          Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                          ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                          ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                          bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                          bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                          What about sparse matrices (13)

                                          bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

                                          bull But canrsquot reorder outer loop for 25D need another ideabull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = AB

                                          ndash Dependencies ok 25D works just different semiringbull Kleenersquos Algorithm

                                          52

                                          for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

                                          D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

                                          Performance of 25D APSP using Kleene

                                          53

                                          Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

                                          62xspeedup

                                          2x speedup

                                          What about sparse matrices (23)

                                          bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of G(A)

                                          have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

                                          bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering for

                                          2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

                                          bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

                                          multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

                                          separators)

                                          54

                                          What about sparse matrices (33)

                                          bull If matrix stays very sparse lower bound unattainable new one bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

                                          data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

                                          bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

                                          Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

                                          bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

                                          along dimensions most likely to minimize cost55

                                          Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                          ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                          ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                          bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                          bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                          Symmetric Eigenproblem and SVD

                                          bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd

                                          bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12

                                          ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea

                                          b+1

                                          b+1

                                          Successive Band Reduction (BischofLangSun)

                                          1

                                          b+1

                                          b+1

                                          d+1

                                          c

                                          Successive Band Reduction (BischofLangSun)

                                          b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                          1Q1

                                          b+1

                                          b+1

                                          d+1

                                          c

                                          b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                          Successive Band Reduction (BischofLangSun)

                                          12

                                          Q1

                                          b+1

                                          b+1

                                          d+1

                                          d+c

                                          d+c

                                          c

                                          b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                          Successive Band Reduction (BischofLangSun)

                                          1

                                          12

                                          Q1

                                          Q1T

                                          b+1

                                          b+1

                                          d+1

                                          d+1

                                          cd+c

                                          d+c

                                          c

                                          b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                          Successive Band Reduction (BischofLangSun)

                                          1

                                          1

                                          2

                                          2Q1

                                          Q1T

                                          b+1

                                          b+1

                                          d+1

                                          d+1

                                          cd+c

                                          d+c

                                          d+c

                                          d+c

                                          c

                                          b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                          Successive Band Reduction (BischofLangSun)

                                          1

                                          1

                                          2

                                          2

                                          3

                                          3

                                          Q1

                                          Q1T

                                          Q2

                                          Q2T

                                          b+1

                                          b+1

                                          d+1

                                          d+1

                                          d+c

                                          d+c

                                          d+c

                                          d+c

                                          c

                                          c

                                          b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                          Successive Band Reduction (BischofLangSun)

                                          1

                                          1

                                          2

                                          2

                                          3

                                          3

                                          4

                                          4

                                          Q1

                                          Q1T

                                          Q2

                                          Q2T

                                          Q3

                                          Q3T

                                          b+1

                                          b+1

                                          d+1

                                          d+1

                                          d+c

                                          d+c

                                          d+c

                                          d+c

                                          c

                                          c

                                          b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                          Successive Band Reduction (BischofLangSun)

                                          1

                                          1

                                          2

                                          2

                                          3

                                          3

                                          4

                                          4

                                          5

                                          5

                                          Q1

                                          Q1T

                                          Q2

                                          Q2T

                                          Q3

                                          Q3T

                                          Q4

                                          Q4T

                                          b+1

                                          b+1

                                          d+1

                                          d+1

                                          c

                                          c

                                          d+c

                                          d+c

                                          d+c

                                          d+c

                                          b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                          Successive Band Reduction (BischofLangSun)

                                          1

                                          1

                                          2

                                          2

                                          3

                                          3

                                          4

                                          4

                                          5

                                          5

                                          Q5T

                                          Q1

                                          Q1T

                                          Q2

                                          Q2T

                                          Q3

                                          Q3T

                                          Q5

                                          Q4

                                          Q4T

                                          b+1

                                          b+1

                                          d+1

                                          d+1

                                          c

                                          c

                                          d+c

                                          d+c

                                          d+c

                                          d+c

                                          b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                          Successive Band Reduction (BischofLangSun)

                                          1

                                          1

                                          2

                                          2

                                          3

                                          3

                                          4

                                          4

                                          5

                                          5

                                          6

                                          6

                                          Q5T

                                          Q1

                                          Q1T

                                          Q2

                                          Q2T

                                          Q3

                                          Q3T

                                          Q5

                                          Q4

                                          Q4T

                                          b+1

                                          b+1

                                          d+1

                                          d+1

                                          c

                                          c

                                          d+c

                                          d+c

                                          d+c

                                          d+c

                                          b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                          Successive Band Reduction (BischofLangSun)

                                          Conventional vs CA - SBR

                                          Conventional Communication-Avoiding

                                          Touch all data 4 times Touch all data once

                                          >
                                          >

                                          Speedups of Sym Band Reductionvs DSBTRD

                                          bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

                                          bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

                                          bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

                                          bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

                                          bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

                                          Nonsymmetric Eigenproblem

                                          bull No apparent way to modify standard algorithmbull Instead Spectral Divide-and-Conquer

                                          ndash Find orthogonal matrix Q whose leading columns span an invariant subspace of A

                                          ndash QTAQ will be block upper triangularndash Apply recursively to A11 A22

                                          ndash Depends on randomization1 Randomized Rank Revealing QR decomposition2 Randomized location to try splitting spectrum

                                          A11 A12

                                          ε A22

                                          Attaining the Lower bounds SequentialLegend[Existing][Ours][Math-Lib][Random]

                                          Two Levels Memory Hierarchy

                                          Words Messages Words Messages

                                          BLAS-3 [FLPRrsquo99][BDLSTrsquo13][MKL etc] [FLPRrsquo99][BDLSTrsquo13][MKL etc]

                                          Cholesky[Grsquo97][APrsquo00]

                                          [LAPACK][BDHSrsquo09]

                                          [Grsquo97][APrsquo00][BDHSrsquo09] [Grsquo97][APrsquo00][BDHSrsquo09]

                                          Sym Indefinite [BBDDDPSTYrsquo13] [BBDDDPSTYrsquo13]

                                          LU[Grsquo97][Trsquo97]

                                          [GDXrsquo11][BDLSTrsquo13]

                                          [GDXrsquo11][BDLSTrsquo13]

                                          [Grsquo97][Trsquo97] [BDLSTrsquo13] [BDLSTrsquo13]

                                          QR[EGrsquo98][FWrsquo03]

                                          [DGHLrsquo12][BDLSTrsquo13]

                                          [FWrsquo03][DGHLrsquo12][BDLSTrsquo13]

                                          [EGrsquo98][FWrsquo03][BDLSTrsquo13]

                                          [FWrsquo03][BDLSTrsquo13]

                                          Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

                                          Sym Eig amp SVD [BDDrsquo11][BDKrsquo13] [BDDrsquo11]

                                          Non Sym Eig [BDDrsquo11] [BDDrsquo11]

                                          Legend[Existing][Ours][Math-Lib][Random]

                                          Words (BW) Messages (L) Saving factor

                                          BLAS-3 [AGZrsquo94][MTrsquo99][ScaLAPACK][Crsquo69][vGWrsquo97][SDrsquo11] L nP12

                                          Cholesky [ScaLAPACK][Trsquo99][SDrsquo11] L nP12

                                          Sym Indefinite [BBDDDPSTYrsquo13][ScaLAPACK] [BBDDDPSTYrsquo13] L nP12

                                          LU [ScaLAPACK][GDXrsquo11][Trsquo99][SDrsquo11] [GDXrsquo11][Trsquo99][SDrsquo11] L nP12

                                          QR [ScaLAPACK][DGHLrsquo12] [Trsquo99] [DGHLrsquo12][Trsquo99] L nP12

                                          Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

                                          Sym Eig amp SVD [BDDrsquo11][BDKrsquo13][ScaLAPACK] [BDDrsquo11][BDKrsquo13] L nP12

                                          Non-Sym Eig [BDDrsquo11] [BDDrsquo11] BW P12 L n

                                          Attaining with extra memory 25D M=(cn2P)

                                          Attaining the Lower bounds Parallel 2DM=(n2P)(Ignoring poly-log(P) factors words = ( n2 P12) messages = (P12)

                                          Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                          ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                          ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                          bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                          bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                          Avoiding Communication in Iterative Linear Algebra

                                          bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

                                          bull Conjugate Gradients (CG) GMRES Lanczos Arnoldi hellip bull Goal minimize communication

                                          ndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

                                          bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

                                          ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

                                          bull Lots of speed up possible (modeled and measured)ndash Price some redundant computationndash Challenges Poor partitioning Preconditioning Num Stability

                                          75

                                          Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                          ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                          ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                          bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                          bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                          Example The Difficulty of Tuning SpMV

                                          bull n = 21200bull nnz = 15 M

                                          bull Source NASA structural analysis problem (raefsky)

                                          77

                                          Example The Difficulty of Tuning

                                          bull n = 21200bull nnz = 15 M

                                          bull Source NASA structural analysis problem (raefsky)

                                          bull 8x8 dense substructure exploit this to limit mem_refs

                                          78

                                          Speedups on Itanium 2 The Need for Search

                                          Reference

                                          Best 4x2

                                          Mflops

                                          Mflops

                                          79

                                          Register Profile Itanium 2

                                          190 Mflops

                                          1190 Mflops

                                          80

                                          Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16

                                          Itanium 2 - 33Itanium 1 - 8

                                          252 Mflops

                                          122 Mflops

                                          820 Mflops

                                          459 Mflops

                                          247 Mflops

                                          107 Mflops

                                          12 Gflops

                                          190 Mflops

                                          Another example of tuning challenges for SpMV

                                          bull Ex11 matrix (fluid flow)

                                          bull More complicated non-zero structure in general

                                          bull N = 16614bull NNZ = 11M

                                          82

                                          Zoom in to top corner

                                          bull More complicated non-zero structure in general

                                          bull N = 16614bull NNZ = 11M

                                          83

                                          3x3 blocks look natural buthellip

                                          bull Example 3x3 blockingndash Logical grid of 3x3 cells

                                          bull But would lead to lots of ldquofill-inrdquo

                                          84

                                          Extra Work Can Improve Efficiency

                                          bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

                                          bull On Pentium III 15x speedupndash Actual mflop rate 152 = 225 higher

                                          85

                                          Source Accelerator Cavity Design Problem (Ko via Husbands)

                                          86

                                          100x100 Submatrix Along Diagonal

                                          Summer School Lecture 787

                                          Post-RCM Reordering

                                          88

                                          Effect of Combined RCM+TSP Reordering

                                          Before Green + RedAfter Green + Blue

                                          Summer School Lecture 789

                                          2x speedups on Pentium 4 Power 4 hellip

                                          Summary of Other Performance Optimizations

                                          bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

                                          bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

                                          bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

                                          90

                                          Optimized Sparse Kernel Interface - OSKI

                                          bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

                                          bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

                                          bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

                                          software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

                                          91

                                          Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                          ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                          ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                          bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                          bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                          93

                                          Example Classical Conjugate Gradient (CG)

                                          SpMVs and dot products require communication in

                                          each iteration

                                          via CA Matrix Powers Kernel

                                          Global reduction to compute G

                                          94

                                          Example CA-Conjugate Gradient

                                          Local computations within inner loop require

                                          no communication

                                          Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                          ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                          ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                          bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                          bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                          96

                                          Slower convergence due

                                          to roundoff

                                          Loss of accuracy due to roundoff

                                          At s = 16 monomial basis is rank deficient Method breaks down

                                          Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

                                          CA-CG (monomial)CG

                                          machine precision

                                          97

                                          Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                          ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                          ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                          bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                          bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                          What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

                                          bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

                                          bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

                                          matrices

                                          Explicit (O(nnz)) Implicit (o(nnz))

                                          Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

                                          Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

                                          Indices

                                          Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                          ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                          ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                          bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                          bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                          101

                                          bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

                                          ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

                                          demanded by customers (construction engineers) otherwise they donrsquot believe results

                                          ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

                                          ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

                                          Reproducible Floating Point Computation

                                          Absolute Error for Random Vectors

                                          Same magnitude opposite signs

                                          Intel MKL non-reproducibility

                                          Relative Error for Orthogonal vectors

                                          Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

                                          Sign notreproducible

                                          103

                                          bull Consider summation or dot productbull Goals

                                          1 Same answer independent of layout processors order of summands

                                          2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

                                          bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

                                          GoalsApproaches for Reproducibility

                                          104

                                          Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

                                          Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

                                          Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

                                          bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

                                          Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

                                          bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

                                          Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

                                          Instruments NEC Nokia NVIDIA Samsung Oracle

                                          bull bebopcsberkeleyedu

                                          Summary

                                          Donrsquot Communichellip

                                          106

                                          Time to redesign all linear algebra n-body hellip algorithms and software

                                          (and compilers)

                                          • Implementing Communication-Avoiding Algorithms
                                          • Why avoid communication
                                          • Goals
                                          • Outline
                                          • Outline (2)
                                          • Lower bound for all ldquon3-likerdquo linear algebra
                                          • Lower bound for all ldquon3-likerdquo linear algebra (2)
                                          • Lower bound for all ldquon3-likerdquo linear algebra (3)
                                          • Limits to parallel scaling (12)
                                          • Limits to parallel scaling (22)
                                          • Can we attain these lower bounds
                                          • Outline (3)
                                          • 25D Matrix Multiplication
                                          • 25D Matrix Multiplication (2)
                                          • 25D Matmul on BGP 16K nodes 64K cores (2)
                                          • Perfect Strong Scaling ndash in Time and Energy (12)
                                          • Perfect Strong Scaling ndash in Time and Energy (22)
                                          • Handling Heterogeneity
                                          • Application to Tensor Contractions
                                          • C(ijk) = Σm A(ijm)B(mk)
                                          • Application to Tensor Contractions (2)
                                          • Communication Lower Bounds for Strassen-like matmul algorithms
                                          • vs
                                          • Slide 26
                                          • Strassen-like beyond matmul
                                          • Cache and Network Oblivious Algorithms
                                          • CARMA Performance Distributed Memory
                                          • CARMA Performance Distributed Memory (2)
                                          • CARMA Performance Shared Memory
                                          • CARMA Performance Shared Memory (2)
                                          • Why is CARMA Faster in Shared Memory
                                          • Outline (4)
                                          • One-sided Factorizations (LU QR) so far
                                          • TSQR An Architecture-Dependent Algorithm
                                          • Back to LU Using similar idea for TSLU as TSQR Use reduction
                                          • Minimizing Communication in TSLU
                                          • Making TSLU Numerically Stable
                                          • Stability of LU using TSLU CALU
                                          • Why is stability of TSLU just a ldquoThmrdquo
                                          • Fixing TSLU
                                          • 2D CALU with Tournament Pivoting
                                          • 25D CALU with Tournament Pivoting (c=4 copies)
                                          • Exascale Machine Parameters Source DOE Exascale Workshop
                                          • Exascale predicted speedups for Gaussian Elimination 2D CA
                                          • 25D vs 2D LU With and Without Pivoting
                                          • Other CA algorithms for Ax=b least squares(13)
                                          • Other CA algorithms for Ax=b least squares (23)
                                          • Other CA algorithms for Ax=b least squares (33)
                                          • Outline (5)
                                          • What about sparse matrices (13)
                                          • Performance of 25D APSP using Kleene
                                          • What about sparse matrices (23)
                                          • What about sparse matrices (33)
                                          • Outline (6)
                                          • Symmetric Eigenproblem and SVD
                                          • Slide 58
                                          • Slide 59
                                          • Slide 60
                                          • Slide 61
                                          • Slide 62
                                          • Slide 63
                                          • Slide 64
                                          • Slide 65
                                          • Slide 66
                                          • Slide 67
                                          • Slide 68
                                          • Conventional vs CA - SBR
                                          • Speedups of Sym Band Reduction vs DSBTRD
                                          • Nonsymmetric Eigenproblem
                                          • Attaining the Lower bounds Sequential
                                          • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
                                          • Outline (7)
                                          • Avoiding Communication in Iterative Linear Algebra
                                          • Outline (8)
                                          • Example The Difficulty of Tuning SpMV
                                          • Example The Difficulty of Tuning
                                          • Speedups on Itanium 2 The Need for Search
                                          • Register Profile Itanium 2
                                          • Register Profiles IBM and Intel IA-64
                                          • Another example of tuning challenges for SpMV
                                          • Zoom in to top corner
                                          • 3x3 blocks look natural buthellip
                                          • Extra Work Can Improve Efficiency
                                          • Slide 86
                                          • Slide 87
                                          • Slide 88
                                          • Slide 89
                                          • Summary of Other Performance Optimizations
                                          • Optimized Sparse Kernel Interface - OSKI
                                          • Outline (9)
                                          • Example Classical Conjugate Gradient (CG)
                                          • Example CA-Conjugate Gradient
                                          • Outline (10)
                                          • Slide 96
                                          • Slide 97
                                          • Outline (11)
                                          • What is a ldquosparse matrixrdquo
                                          • Outline (12)
                                          • Reproducible Floating Point Computation
                                          • Intel MKL non-reproducibility
                                          • GoalsApproaches for Reproducibility
                                          • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
                                          • Collaborators and Supporters
                                          • Summary

                                            Communication Lower Bounds for Strassen-like matmul algorithms

                                            bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

                                            bull Extends up to M = n2 p2ω bull Extends to rectangular case multiply (mxn)(nxp) in q mults

                                            ndash words_moved = Ω (flopsM^(logmpq -1))

                                            bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz also in JACMbull Is the lower bound attainable

                                            Classical O(n3) matmul

                                            words_moved =Ω (M(nM12)3P)

                                            Strassenrsquos O(nlg7) matmul

                                            words_moved =Ω (M(nM12)lg7P)

                                            Strassen-like O(nω) matmul

                                            words_moved =Ω (M(nM12)ωP)

                                            vs

                                            Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

                                            Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

                                            CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

                                            Communication Avoiding Parallel Strassen (CAPS)

                                            Best way to interleaveBFS and DFS is an tuning parameter

                                            26

                                            Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

                                            Speedups 24-184(over previous Strassen-based algorithms)

                                            Invited to appear as Research Highlight in CACM

                                            Strassen-like beyond matmul

                                            bull Thm (D Dumitriu Holtzrsquo07) Any Strassen-like O(nω) matmul algorithm can be used to build a numerically stable O(nω+η) algorithm for any ηgt0 for Ax=b least squares eig SVD hellipndash ηgt0 needed to deal with numerical stabilityndash Strassen already stable so η=0

                                            bull Thm For sequential versions of these algorithms Words_moved = O(nω+ηM(ω+η)2 ndash 1 + n2 log n) ie attain expected lower bound

                                            Ballard D Holtz Schwartz

                                            Cache and Network Oblivious Algorithms

                                            bull Motivation Minimizes communication at every level of a hierarchical system without tuning parameters (in theory)ndash Not always 25D Matmul on BGP was topology aware

                                            bull CAPS Divide-and-conquer choose BFS or DFS to adapt to processors available memory

                                            bull CARMAndash Divide-and-conquer classical matmul divide largest of 3

                                            dimensions to create two subproblemsndash Choose BFS or DFS to adapt to processors available memory

                                            CARMA Performance Distributed Memory

                                            Square m = k = n = 6144

                                            ScaLAPACK

                                            CARMA

                                            Peak

                                            (log)

                                            (log)

                                            Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA

                                            CARMA Performance Distributed Memory

                                            Inner Product m = n = 192 k = 6291456

                                            ScaLAPACK

                                            CARMAPeak

                                            (log)

                                            (log)

                                            Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA

                                            CARMA Performance Shared Memory

                                            Square m = k = n

                                            MKL (double)CARMA (double)

                                            MKL (single)CARMA (single)

                                            Peak (single)

                                            Peak (double)

                                            (log)

                                            (linear)

                                            Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

                                            CARMA Performance Shared Memory

                                            Inner Product m = n = 64

                                            MKL (double)

                                            CARMA (double)

                                            MKL (single)

                                            CARMA (single)

                                            (log)

                                            (linear)

                                            Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

                                            Why is CARMA Faster in Shared MemoryL3 Cache Misses

                                            Shared Memory Inner Product (m = n = 64 k = 524288)

                                            97 Fewer Misses

                                            86 Fewer Misses

                                            (linear)

                                            Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                            ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                            ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                            bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                            bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                            One-sided Factorizations (LU QR) so farbull Classical Approach for i=1 to n update column i update trailing matrixbull words_moved = O(n3)

                                            35

                                            bull Blocked Approach (LAPACK) for i=1 to nb update block i of b columns update trailing matrixbull words moved = O(n3M13)

                                            bull Recursive Approach func factor(A) if A has 1 column update it

                                            else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)

                                            bull None of these approaches minimizes messagesbull Parallel case Partial

                                            Pivoting =gt n reductionsbull Need another idea

                                            TSQR An Architecture-Dependent Algorithm

                                            W =

                                            W0

                                            W1

                                            W2

                                            W3

                                            R00

                                            R10

                                            R20

                                            R30

                                            R01

                                            R11

                                            R02Parallel

                                            W =

                                            W0

                                            W1

                                            W2

                                            W3

                                            R01 R02

                                            R00

                                            R03

                                            SequentialStreaming

                                            W =

                                            W0

                                            W1

                                            W2

                                            W3

                                            R00

                                            R01R01

                                            R11

                                            R02

                                            R11

                                            R03

                                            Dual Core

                                            Can choose reduction tree dynamicallyMulticore Multisocket Multirack Multisite Out-of-core

                                            Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

                                            Wnxb =

                                            W1

                                            W2

                                            W3

                                            W4

                                            P1middotL1middotU1

                                            P2middotL2middotU2

                                            P3middotL3middotU3

                                            P4middotL4middotU4

                                            =

                                            Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

                                            W1rsquoW2rsquoW3rsquoW4rsquo

                                            P12middotL12middotU12

                                            P34middotL34middotU34

                                            =Choose b pivot rows call them W12rsquo

                                            Choose b pivot rows call them W34rsquo

                                            W12rsquoW34rsquo

                                            = P1234middotL1234middotU1234

                                            Choose b pivot rows

                                            Go back to W and use these b pivot rows (move them to top do LU without pivoting)

                                            37

                                            Minimizing Communication in TSLU

                                            W = W1

                                            W2

                                            W3

                                            W4

                                            LULULULU

                                            LU

                                            LULUParallel

                                            W = W1

                                            W2

                                            W3

                                            W4

                                            LULU

                                            LU

                                            LUSequentialStreaming

                                            W = W1

                                            W2

                                            W3

                                            W4

                                            LULU LU

                                            LULU

                                            LULU

                                            Dual Core

                                            Can choose reduction tree dynamically to match architecture as before

                                            38

                                            Making TSLU Numerically Stable

                                            bull Details matterndash Going up the tree we could do LU either on original rows of A

                                            (tournament pivoting) or computed rows of Undash Only tournament pivoting stable

                                            bull ldquoThmrdquo New scheme as stable as Partial Pivoting (GEPP) in following sense Get same Schur complements as GEPP applied to different input matrix whose entries are blocks taken from input A

                                            bull Why just a ldquoThmrdquo

                                            39

                                            Stability of LU using TSLU CALU

                                            Summer School Lecture 4 40

                                            bull Empirical testingndash Both random matrices and ldquospecial onesrdquondash Both binary tree (BCALU) and flat-tree (FCALU)ndash 3 metrics ||PA-LU||||A|| normwise and componentwise backward errorsndash See [D Grigori Xiang 2010] for details

                                            Why is stability of TSLU just a ldquoThmrdquo

                                            bull Proof is correct ndash in exact arithmeticbull Experiment

                                            ndash Generate 100 random 6x6 rank 3 matrices in Matlabndash [LUP] = lu(A) do LU without pivoting on PA compare L factors are

                                            they the samebull Compute || L ndash Lnp || A few 0rsquos A few infinrsquos a few NaNsbull Rest mostly O(1)

                                            ndash Why Floating point is nonassociative doing arithmetic in different order gives different rounding errors

                                            ndash Same experiment with rank 6 matrices || L ndash Lnp || usually nonzero O(macheps)

                                            ndash Same experiment with 20x20 rank 4 matrices || L ndash Lnp || often O(103)

                                            bull Much harder to break TSLU but possiblendash Occurred when using TSLU to factorize a low-rank subdiagonal

                                            panel in symmetric-indefinite factorization 41

                                            Fixing TSLU

                                            bull Run TSLU quickly test for stability fix if necessary (rare)

                                            bull Test conditioning of U if not tiny (usual case) proceed elsebull Compute || L || if not big (usual case) proceed elsebull Factor A = QR using TSQR thenbull Factor Q = PLU using TSLU thenbull A = PL(UR) with UR as upper triangular factor

                                            bull Last topic in lecture how to guarantee floating point reproducibility

                                            42

                                            2D CALU with Tournament Pivoting

                                            43

                                            25D CALU with Tournament Pivoting (c=4 copies)

                                            44

                                            Exascale Machine ParametersSource DOE Exascale Workshop

                                            bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

                                            Exascale predicted speedupsfor Gaussian Elimination

                                            2D CA-LU vs ScaLAPACK-LU

                                            log2 (p)

                                            log 2

                                            (n2 p

                                            ) =

                                            log 2

                                            (mem

                                            ory_

                                            per_

                                            proc

                                            )

                                            Up to 29x

                                            25D vs 2D LUWith and Without Pivoting

                                            Other CA algorithms for Ax=b least squares(13)

                                            bull A symmetric and indefinitendash Seek factorization that retains symmetry PAPT = LDLT D

                                            ldquosimplerdquobull Save frac12 flops preserve inertia

                                            ndash Usual approach Bunch-Kaufmanbull D block diagonal with 1x1 and 2x2 blocksbull Pivot search down column along row (lots of communication)

                                            ndash Alternative Aasenbull D = tridiagonal = Tbull Two steps

                                            ndash PAPT = LTLT where T is banded using TSLU

                                            48

                                            0 0

                                            0

                                            0 0

                                            0

                                            0

                                            hellip

                                            hellip

                                            ndash Solvefactor narrow band problem with Tbull Up to 28x faster than MKL Best Paper at IPDPSrsquo13

                                            Other CA algorithms for Ax=b least squares (23)bull Minimizing bandwidth and latency for sequential GEPP

                                            ndash So far could not do partial pivoting and minimize messages just words

                                            ndash Challengebull Column layout good for choosing pivots bad for matmulbull Blocked layout good for matmul bad for choosing pivots

                                            ndash Solution use both layouts switching between thembull ldquoShape Morphing LUrdquo or SMLU

                                            49

                                            bull func factor(A) if A has 1 column update it else factor(left half of A)

                                            update right half of A

                                            factor(right half of A)

                                            bull Words = O(n3M12)

                                            bull Messages = O(n3M)

                                            bull func factor(A) if A has 1 column update it else factor(left half of A) reshape to recursive block format update right half of A reshape to columnwise format factor(right half of A)

                                            bull Words = O(n3M12)

                                            bull Messages = O(n3M32)

                                            Other CA algorithms for Ax=b least squares (33)bull Need for pivoting arises beyond LU in QR

                                            ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)

                                            ndash Usual approach like Partial Pivotingbull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound

                                            ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two

                                            groups of b columns either using usual approach or something better (GuEisenstat)

                                            bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix

                                            ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 50

                                            Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                            ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                            ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                            bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                            bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                            What about sparse matrices (13)

                                            bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

                                            bull But canrsquot reorder outer loop for 25D need another ideabull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = AB

                                            ndash Dependencies ok 25D works just different semiringbull Kleenersquos Algorithm

                                            52

                                            for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

                                            D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

                                            Performance of 25D APSP using Kleene

                                            53

                                            Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

                                            62xspeedup

                                            2x speedup

                                            What about sparse matrices (23)

                                            bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of G(A)

                                            have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

                                            bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering for

                                            2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

                                            bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

                                            multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

                                            separators)

                                            54

                                            What about sparse matrices (33)

                                            bull If matrix stays very sparse lower bound unattainable new one bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

                                            data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

                                            bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

                                            Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

                                            bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

                                            along dimensions most likely to minimize cost55

                                            Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                            ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                            ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                            bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                            bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                            Symmetric Eigenproblem and SVD

                                            bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd

                                            bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12

                                            ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea

                                            b+1

                                            b+1

                                            Successive Band Reduction (BischofLangSun)

                                            1

                                            b+1

                                            b+1

                                            d+1

                                            c

                                            Successive Band Reduction (BischofLangSun)

                                            b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                            1Q1

                                            b+1

                                            b+1

                                            d+1

                                            c

                                            b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                            Successive Band Reduction (BischofLangSun)

                                            12

                                            Q1

                                            b+1

                                            b+1

                                            d+1

                                            d+c

                                            d+c

                                            c

                                            b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                            Successive Band Reduction (BischofLangSun)

                                            1

                                            12

                                            Q1

                                            Q1T

                                            b+1

                                            b+1

                                            d+1

                                            d+1

                                            cd+c

                                            d+c

                                            c

                                            b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                            Successive Band Reduction (BischofLangSun)

                                            1

                                            1

                                            2

                                            2Q1

                                            Q1T

                                            b+1

                                            b+1

                                            d+1

                                            d+1

                                            cd+c

                                            d+c

                                            d+c

                                            d+c

                                            c

                                            b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                            Successive Band Reduction (BischofLangSun)

                                            1

                                            1

                                            2

                                            2

                                            3

                                            3

                                            Q1

                                            Q1T

                                            Q2

                                            Q2T

                                            b+1

                                            b+1

                                            d+1

                                            d+1

                                            d+c

                                            d+c

                                            d+c

                                            d+c

                                            c

                                            c

                                            b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                            Successive Band Reduction (BischofLangSun)

                                            1

                                            1

                                            2

                                            2

                                            3

                                            3

                                            4

                                            4

                                            Q1

                                            Q1T

                                            Q2

                                            Q2T

                                            Q3

                                            Q3T

                                            b+1

                                            b+1

                                            d+1

                                            d+1

                                            d+c

                                            d+c

                                            d+c

                                            d+c

                                            c

                                            c

                                            b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                            Successive Band Reduction (BischofLangSun)

                                            1

                                            1

                                            2

                                            2

                                            3

                                            3

                                            4

                                            4

                                            5

                                            5

                                            Q1

                                            Q1T

                                            Q2

                                            Q2T

                                            Q3

                                            Q3T

                                            Q4

                                            Q4T

                                            b+1

                                            b+1

                                            d+1

                                            d+1

                                            c

                                            c

                                            d+c

                                            d+c

                                            d+c

                                            d+c

                                            b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                            Successive Band Reduction (BischofLangSun)

                                            1

                                            1

                                            2

                                            2

                                            3

                                            3

                                            4

                                            4

                                            5

                                            5

                                            Q5T

                                            Q1

                                            Q1T

                                            Q2

                                            Q2T

                                            Q3

                                            Q3T

                                            Q5

                                            Q4

                                            Q4T

                                            b+1

                                            b+1

                                            d+1

                                            d+1

                                            c

                                            c

                                            d+c

                                            d+c

                                            d+c

                                            d+c

                                            b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                            Successive Band Reduction (BischofLangSun)

                                            1

                                            1

                                            2

                                            2

                                            3

                                            3

                                            4

                                            4

                                            5

                                            5

                                            6

                                            6

                                            Q5T

                                            Q1

                                            Q1T

                                            Q2

                                            Q2T

                                            Q3

                                            Q3T

                                            Q5

                                            Q4

                                            Q4T

                                            b+1

                                            b+1

                                            d+1

                                            d+1

                                            c

                                            c

                                            d+c

                                            d+c

                                            d+c

                                            d+c

                                            b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                            Successive Band Reduction (BischofLangSun)

                                            Conventional vs CA - SBR

                                            Conventional Communication-Avoiding

                                            Touch all data 4 times Touch all data once

                                            >
                                            >

                                            Speedups of Sym Band Reductionvs DSBTRD

                                            bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

                                            bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

                                            bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

                                            bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

                                            bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

                                            Nonsymmetric Eigenproblem

                                            bull No apparent way to modify standard algorithmbull Instead Spectral Divide-and-Conquer

                                            ndash Find orthogonal matrix Q whose leading columns span an invariant subspace of A

                                            ndash QTAQ will be block upper triangularndash Apply recursively to A11 A22

                                            ndash Depends on randomization1 Randomized Rank Revealing QR decomposition2 Randomized location to try splitting spectrum

                                            A11 A12

                                            ε A22

                                            Attaining the Lower bounds SequentialLegend[Existing][Ours][Math-Lib][Random]

                                            Two Levels Memory Hierarchy

                                            Words Messages Words Messages

                                            BLAS-3 [FLPRrsquo99][BDLSTrsquo13][MKL etc] [FLPRrsquo99][BDLSTrsquo13][MKL etc]

                                            Cholesky[Grsquo97][APrsquo00]

                                            [LAPACK][BDHSrsquo09]

                                            [Grsquo97][APrsquo00][BDHSrsquo09] [Grsquo97][APrsquo00][BDHSrsquo09]

                                            Sym Indefinite [BBDDDPSTYrsquo13] [BBDDDPSTYrsquo13]

                                            LU[Grsquo97][Trsquo97]

                                            [GDXrsquo11][BDLSTrsquo13]

                                            [GDXrsquo11][BDLSTrsquo13]

                                            [Grsquo97][Trsquo97] [BDLSTrsquo13] [BDLSTrsquo13]

                                            QR[EGrsquo98][FWrsquo03]

                                            [DGHLrsquo12][BDLSTrsquo13]

                                            [FWrsquo03][DGHLrsquo12][BDLSTrsquo13]

                                            [EGrsquo98][FWrsquo03][BDLSTrsquo13]

                                            [FWrsquo03][BDLSTrsquo13]

                                            Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

                                            Sym Eig amp SVD [BDDrsquo11][BDKrsquo13] [BDDrsquo11]

                                            Non Sym Eig [BDDrsquo11] [BDDrsquo11]

                                            Legend[Existing][Ours][Math-Lib][Random]

                                            Words (BW) Messages (L) Saving factor

                                            BLAS-3 [AGZrsquo94][MTrsquo99][ScaLAPACK][Crsquo69][vGWrsquo97][SDrsquo11] L nP12

                                            Cholesky [ScaLAPACK][Trsquo99][SDrsquo11] L nP12

                                            Sym Indefinite [BBDDDPSTYrsquo13][ScaLAPACK] [BBDDDPSTYrsquo13] L nP12

                                            LU [ScaLAPACK][GDXrsquo11][Trsquo99][SDrsquo11] [GDXrsquo11][Trsquo99][SDrsquo11] L nP12

                                            QR [ScaLAPACK][DGHLrsquo12] [Trsquo99] [DGHLrsquo12][Trsquo99] L nP12

                                            Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

                                            Sym Eig amp SVD [BDDrsquo11][BDKrsquo13][ScaLAPACK] [BDDrsquo11][BDKrsquo13] L nP12

                                            Non-Sym Eig [BDDrsquo11] [BDDrsquo11] BW P12 L n

                                            Attaining with extra memory 25D M=(cn2P)

                                            Attaining the Lower bounds Parallel 2DM=(n2P)(Ignoring poly-log(P) factors words = ( n2 P12) messages = (P12)

                                            Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                            ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                            ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                            bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                            bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                            Avoiding Communication in Iterative Linear Algebra

                                            bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

                                            bull Conjugate Gradients (CG) GMRES Lanczos Arnoldi hellip bull Goal minimize communication

                                            ndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

                                            bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

                                            ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

                                            bull Lots of speed up possible (modeled and measured)ndash Price some redundant computationndash Challenges Poor partitioning Preconditioning Num Stability

                                            75

                                            Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                            ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                            ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                            bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                            bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                            Example The Difficulty of Tuning SpMV

                                            bull n = 21200bull nnz = 15 M

                                            bull Source NASA structural analysis problem (raefsky)

                                            77

                                            Example The Difficulty of Tuning

                                            bull n = 21200bull nnz = 15 M

                                            bull Source NASA structural analysis problem (raefsky)

                                            bull 8x8 dense substructure exploit this to limit mem_refs

                                            78

                                            Speedups on Itanium 2 The Need for Search

                                            Reference

                                            Best 4x2

                                            Mflops

                                            Mflops

                                            79

                                            Register Profile Itanium 2

                                            190 Mflops

                                            1190 Mflops

                                            80

                                            Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16

                                            Itanium 2 - 33Itanium 1 - 8

                                            252 Mflops

                                            122 Mflops

                                            820 Mflops

                                            459 Mflops

                                            247 Mflops

                                            107 Mflops

                                            12 Gflops

                                            190 Mflops

                                            Another example of tuning challenges for SpMV

                                            bull Ex11 matrix (fluid flow)

                                            bull More complicated non-zero structure in general

                                            bull N = 16614bull NNZ = 11M

                                            82

                                            Zoom in to top corner

                                            bull More complicated non-zero structure in general

                                            bull N = 16614bull NNZ = 11M

                                            83

                                            3x3 blocks look natural buthellip

                                            bull Example 3x3 blockingndash Logical grid of 3x3 cells

                                            bull But would lead to lots of ldquofill-inrdquo

                                            84

                                            Extra Work Can Improve Efficiency

                                            bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

                                            bull On Pentium III 15x speedupndash Actual mflop rate 152 = 225 higher

                                            85

                                            Source Accelerator Cavity Design Problem (Ko via Husbands)

                                            86

                                            100x100 Submatrix Along Diagonal

                                            Summer School Lecture 787

                                            Post-RCM Reordering

                                            88

                                            Effect of Combined RCM+TSP Reordering

                                            Before Green + RedAfter Green + Blue

                                            Summer School Lecture 789

                                            2x speedups on Pentium 4 Power 4 hellip

                                            Summary of Other Performance Optimizations

                                            bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

                                            bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

                                            bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

                                            90

                                            Optimized Sparse Kernel Interface - OSKI

                                            bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

                                            bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

                                            bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

                                            software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

                                            91

                                            Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                            ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                            ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                            bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                            bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                            93

                                            Example Classical Conjugate Gradient (CG)

                                            SpMVs and dot products require communication in

                                            each iteration

                                            via CA Matrix Powers Kernel

                                            Global reduction to compute G

                                            94

                                            Example CA-Conjugate Gradient

                                            Local computations within inner loop require

                                            no communication

                                            Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                            ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                            ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                            bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                            bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                            96

                                            Slower convergence due

                                            to roundoff

                                            Loss of accuracy due to roundoff

                                            At s = 16 monomial basis is rank deficient Method breaks down

                                            Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

                                            CA-CG (monomial)CG

                                            machine precision

                                            97

                                            Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                            ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                            ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                            bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                            bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                            What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

                                            bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

                                            bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

                                            matrices

                                            Explicit (O(nnz)) Implicit (o(nnz))

                                            Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

                                            Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

                                            Indices

                                            Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                            ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                            ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                            bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                            bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                            101

                                            bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

                                            ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

                                            demanded by customers (construction engineers) otherwise they donrsquot believe results

                                            ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

                                            ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

                                            Reproducible Floating Point Computation

                                            Absolute Error for Random Vectors

                                            Same magnitude opposite signs

                                            Intel MKL non-reproducibility

                                            Relative Error for Orthogonal vectors

                                            Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

                                            Sign notreproducible

                                            103

                                            bull Consider summation or dot productbull Goals

                                            1 Same answer independent of layout processors order of summands

                                            2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

                                            bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

                                            GoalsApproaches for Reproducibility

                                            104

                                            Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

                                            Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

                                            Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

                                            bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

                                            Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

                                            bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

                                            Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

                                            Instruments NEC Nokia NVIDIA Samsung Oracle

                                            bull bebopcsberkeleyedu

                                            Summary

                                            Donrsquot Communichellip

                                            106

                                            Time to redesign all linear algebra n-body hellip algorithms and software

                                            (and compilers)

                                            • Implementing Communication-Avoiding Algorithms
                                            • Why avoid communication
                                            • Goals
                                            • Outline
                                            • Outline (2)
                                            • Lower bound for all ldquon3-likerdquo linear algebra
                                            • Lower bound for all ldquon3-likerdquo linear algebra (2)
                                            • Lower bound for all ldquon3-likerdquo linear algebra (3)
                                            • Limits to parallel scaling (12)
                                            • Limits to parallel scaling (22)
                                            • Can we attain these lower bounds
                                            • Outline (3)
                                            • 25D Matrix Multiplication
                                            • 25D Matrix Multiplication (2)
                                            • 25D Matmul on BGP 16K nodes 64K cores (2)
                                            • Perfect Strong Scaling ndash in Time and Energy (12)
                                            • Perfect Strong Scaling ndash in Time and Energy (22)
                                            • Handling Heterogeneity
                                            • Application to Tensor Contractions
                                            • C(ijk) = Σm A(ijm)B(mk)
                                            • Application to Tensor Contractions (2)
                                            • Communication Lower Bounds for Strassen-like matmul algorithms
                                            • vs
                                            • Slide 26
                                            • Strassen-like beyond matmul
                                            • Cache and Network Oblivious Algorithms
                                            • CARMA Performance Distributed Memory
                                            • CARMA Performance Distributed Memory (2)
                                            • CARMA Performance Shared Memory
                                            • CARMA Performance Shared Memory (2)
                                            • Why is CARMA Faster in Shared Memory
                                            • Outline (4)
                                            • One-sided Factorizations (LU QR) so far
                                            • TSQR An Architecture-Dependent Algorithm
                                            • Back to LU Using similar idea for TSLU as TSQR Use reduction
                                            • Minimizing Communication in TSLU
                                            • Making TSLU Numerically Stable
                                            • Stability of LU using TSLU CALU
                                            • Why is stability of TSLU just a ldquoThmrdquo
                                            • Fixing TSLU
                                            • 2D CALU with Tournament Pivoting
                                            • 25D CALU with Tournament Pivoting (c=4 copies)
                                            • Exascale Machine Parameters Source DOE Exascale Workshop
                                            • Exascale predicted speedups for Gaussian Elimination 2D CA
                                            • 25D vs 2D LU With and Without Pivoting
                                            • Other CA algorithms for Ax=b least squares(13)
                                            • Other CA algorithms for Ax=b least squares (23)
                                            • Other CA algorithms for Ax=b least squares (33)
                                            • Outline (5)
                                            • What about sparse matrices (13)
                                            • Performance of 25D APSP using Kleene
                                            • What about sparse matrices (23)
                                            • What about sparse matrices (33)
                                            • Outline (6)
                                            • Symmetric Eigenproblem and SVD
                                            • Slide 58
                                            • Slide 59
                                            • Slide 60
                                            • Slide 61
                                            • Slide 62
                                            • Slide 63
                                            • Slide 64
                                            • Slide 65
                                            • Slide 66
                                            • Slide 67
                                            • Slide 68
                                            • Conventional vs CA - SBR
                                            • Speedups of Sym Band Reduction vs DSBTRD
                                            • Nonsymmetric Eigenproblem
                                            • Attaining the Lower bounds Sequential
                                            • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
                                            • Outline (7)
                                            • Avoiding Communication in Iterative Linear Algebra
                                            • Outline (8)
                                            • Example The Difficulty of Tuning SpMV
                                            • Example The Difficulty of Tuning
                                            • Speedups on Itanium 2 The Need for Search
                                            • Register Profile Itanium 2
                                            • Register Profiles IBM and Intel IA-64
                                            • Another example of tuning challenges for SpMV
                                            • Zoom in to top corner
                                            • 3x3 blocks look natural buthellip
                                            • Extra Work Can Improve Efficiency
                                            • Slide 86
                                            • Slide 87
                                            • Slide 88
                                            • Slide 89
                                            • Summary of Other Performance Optimizations
                                            • Optimized Sparse Kernel Interface - OSKI
                                            • Outline (9)
                                            • Example Classical Conjugate Gradient (CG)
                                            • Example CA-Conjugate Gradient
                                            • Outline (10)
                                            • Slide 96
                                            • Slide 97
                                            • Outline (11)
                                            • What is a ldquosparse matrixrdquo
                                            • Outline (12)
                                            • Reproducible Floating Point Computation
                                            • Intel MKL non-reproducibility
                                            • GoalsApproaches for Reproducibility
                                            • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
                                            • Collaborators and Supporters
                                            • Summary

                                              vs

                                              Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

                                              Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

                                              CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

                                              Communication Avoiding Parallel Strassen (CAPS)

                                              Best way to interleaveBFS and DFS is an tuning parameter

                                              26

                                              Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

                                              Speedups 24-184(over previous Strassen-based algorithms)

                                              Invited to appear as Research Highlight in CACM

                                              Strassen-like beyond matmul

                                              bull Thm (D Dumitriu Holtzrsquo07) Any Strassen-like O(nω) matmul algorithm can be used to build a numerically stable O(nω+η) algorithm for any ηgt0 for Ax=b least squares eig SVD hellipndash ηgt0 needed to deal with numerical stabilityndash Strassen already stable so η=0

                                              bull Thm For sequential versions of these algorithms Words_moved = O(nω+ηM(ω+η)2 ndash 1 + n2 log n) ie attain expected lower bound

                                              Ballard D Holtz Schwartz

                                              Cache and Network Oblivious Algorithms

                                              bull Motivation Minimizes communication at every level of a hierarchical system without tuning parameters (in theory)ndash Not always 25D Matmul on BGP was topology aware

                                              bull CAPS Divide-and-conquer choose BFS or DFS to adapt to processors available memory

                                              bull CARMAndash Divide-and-conquer classical matmul divide largest of 3

                                              dimensions to create two subproblemsndash Choose BFS or DFS to adapt to processors available memory

                                              CARMA Performance Distributed Memory

                                              Square m = k = n = 6144

                                              ScaLAPACK

                                              CARMA

                                              Peak

                                              (log)

                                              (log)

                                              Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA

                                              CARMA Performance Distributed Memory

                                              Inner Product m = n = 192 k = 6291456

                                              ScaLAPACK

                                              CARMAPeak

                                              (log)

                                              (log)

                                              Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA

                                              CARMA Performance Shared Memory

                                              Square m = k = n

                                              MKL (double)CARMA (double)

                                              MKL (single)CARMA (single)

                                              Peak (single)

                                              Peak (double)

                                              (log)

                                              (linear)

                                              Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

                                              CARMA Performance Shared Memory

                                              Inner Product m = n = 64

                                              MKL (double)

                                              CARMA (double)

                                              MKL (single)

                                              CARMA (single)

                                              (log)

                                              (linear)

                                              Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

                                              Why is CARMA Faster in Shared MemoryL3 Cache Misses

                                              Shared Memory Inner Product (m = n = 64 k = 524288)

                                              97 Fewer Misses

                                              86 Fewer Misses

                                              (linear)

                                              Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                              ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                              ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                              bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                              bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                              One-sided Factorizations (LU QR) so farbull Classical Approach for i=1 to n update column i update trailing matrixbull words_moved = O(n3)

                                              35

                                              bull Blocked Approach (LAPACK) for i=1 to nb update block i of b columns update trailing matrixbull words moved = O(n3M13)

                                              bull Recursive Approach func factor(A) if A has 1 column update it

                                              else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)

                                              bull None of these approaches minimizes messagesbull Parallel case Partial

                                              Pivoting =gt n reductionsbull Need another idea

                                              TSQR An Architecture-Dependent Algorithm

                                              W =

                                              W0

                                              W1

                                              W2

                                              W3

                                              R00

                                              R10

                                              R20

                                              R30

                                              R01

                                              R11

                                              R02Parallel

                                              W =

                                              W0

                                              W1

                                              W2

                                              W3

                                              R01 R02

                                              R00

                                              R03

                                              SequentialStreaming

                                              W =

                                              W0

                                              W1

                                              W2

                                              W3

                                              R00

                                              R01R01

                                              R11

                                              R02

                                              R11

                                              R03

                                              Dual Core

                                              Can choose reduction tree dynamicallyMulticore Multisocket Multirack Multisite Out-of-core

                                              Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

                                              Wnxb =

                                              W1

                                              W2

                                              W3

                                              W4

                                              P1middotL1middotU1

                                              P2middotL2middotU2

                                              P3middotL3middotU3

                                              P4middotL4middotU4

                                              =

                                              Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

                                              W1rsquoW2rsquoW3rsquoW4rsquo

                                              P12middotL12middotU12

                                              P34middotL34middotU34

                                              =Choose b pivot rows call them W12rsquo

                                              Choose b pivot rows call them W34rsquo

                                              W12rsquoW34rsquo

                                              = P1234middotL1234middotU1234

                                              Choose b pivot rows

                                              Go back to W and use these b pivot rows (move them to top do LU without pivoting)

                                              37

                                              Minimizing Communication in TSLU

                                              W = W1

                                              W2

                                              W3

                                              W4

                                              LULULULU

                                              LU

                                              LULUParallel

                                              W = W1

                                              W2

                                              W3

                                              W4

                                              LULU

                                              LU

                                              LUSequentialStreaming

                                              W = W1

                                              W2

                                              W3

                                              W4

                                              LULU LU

                                              LULU

                                              LULU

                                              Dual Core

                                              Can choose reduction tree dynamically to match architecture as before

                                              38

                                              Making TSLU Numerically Stable

                                              bull Details matterndash Going up the tree we could do LU either on original rows of A

                                              (tournament pivoting) or computed rows of Undash Only tournament pivoting stable

                                              bull ldquoThmrdquo New scheme as stable as Partial Pivoting (GEPP) in following sense Get same Schur complements as GEPP applied to different input matrix whose entries are blocks taken from input A

                                              bull Why just a ldquoThmrdquo

                                              39

                                              Stability of LU using TSLU CALU

                                              Summer School Lecture 4 40

                                              bull Empirical testingndash Both random matrices and ldquospecial onesrdquondash Both binary tree (BCALU) and flat-tree (FCALU)ndash 3 metrics ||PA-LU||||A|| normwise and componentwise backward errorsndash See [D Grigori Xiang 2010] for details

                                              Why is stability of TSLU just a ldquoThmrdquo

                                              bull Proof is correct ndash in exact arithmeticbull Experiment

                                              ndash Generate 100 random 6x6 rank 3 matrices in Matlabndash [LUP] = lu(A) do LU without pivoting on PA compare L factors are

                                              they the samebull Compute || L ndash Lnp || A few 0rsquos A few infinrsquos a few NaNsbull Rest mostly O(1)

                                              ndash Why Floating point is nonassociative doing arithmetic in different order gives different rounding errors

                                              ndash Same experiment with rank 6 matrices || L ndash Lnp || usually nonzero O(macheps)

                                              ndash Same experiment with 20x20 rank 4 matrices || L ndash Lnp || often O(103)

                                              bull Much harder to break TSLU but possiblendash Occurred when using TSLU to factorize a low-rank subdiagonal

                                              panel in symmetric-indefinite factorization 41

                                              Fixing TSLU

                                              bull Run TSLU quickly test for stability fix if necessary (rare)

                                              bull Test conditioning of U if not tiny (usual case) proceed elsebull Compute || L || if not big (usual case) proceed elsebull Factor A = QR using TSQR thenbull Factor Q = PLU using TSLU thenbull A = PL(UR) with UR as upper triangular factor

                                              bull Last topic in lecture how to guarantee floating point reproducibility

                                              42

                                              2D CALU with Tournament Pivoting

                                              43

                                              25D CALU with Tournament Pivoting (c=4 copies)

                                              44

                                              Exascale Machine ParametersSource DOE Exascale Workshop

                                              bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

                                              Exascale predicted speedupsfor Gaussian Elimination

                                              2D CA-LU vs ScaLAPACK-LU

                                              log2 (p)

                                              log 2

                                              (n2 p

                                              ) =

                                              log 2

                                              (mem

                                              ory_

                                              per_

                                              proc

                                              )

                                              Up to 29x

                                              25D vs 2D LUWith and Without Pivoting

                                              Other CA algorithms for Ax=b least squares(13)

                                              bull A symmetric and indefinitendash Seek factorization that retains symmetry PAPT = LDLT D

                                              ldquosimplerdquobull Save frac12 flops preserve inertia

                                              ndash Usual approach Bunch-Kaufmanbull D block diagonal with 1x1 and 2x2 blocksbull Pivot search down column along row (lots of communication)

                                              ndash Alternative Aasenbull D = tridiagonal = Tbull Two steps

                                              ndash PAPT = LTLT where T is banded using TSLU

                                              48

                                              0 0

                                              0

                                              0 0

                                              0

                                              0

                                              hellip

                                              hellip

                                              ndash Solvefactor narrow band problem with Tbull Up to 28x faster than MKL Best Paper at IPDPSrsquo13

                                              Other CA algorithms for Ax=b least squares (23)bull Minimizing bandwidth and latency for sequential GEPP

                                              ndash So far could not do partial pivoting and minimize messages just words

                                              ndash Challengebull Column layout good for choosing pivots bad for matmulbull Blocked layout good for matmul bad for choosing pivots

                                              ndash Solution use both layouts switching between thembull ldquoShape Morphing LUrdquo or SMLU

                                              49

                                              bull func factor(A) if A has 1 column update it else factor(left half of A)

                                              update right half of A

                                              factor(right half of A)

                                              bull Words = O(n3M12)

                                              bull Messages = O(n3M)

                                              bull func factor(A) if A has 1 column update it else factor(left half of A) reshape to recursive block format update right half of A reshape to columnwise format factor(right half of A)

                                              bull Words = O(n3M12)

                                              bull Messages = O(n3M32)

                                              Other CA algorithms for Ax=b least squares (33)bull Need for pivoting arises beyond LU in QR

                                              ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)

                                              ndash Usual approach like Partial Pivotingbull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound

                                              ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two

                                              groups of b columns either using usual approach or something better (GuEisenstat)

                                              bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix

                                              ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 50

                                              Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                              ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                              ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                              bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                              bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                              What about sparse matrices (13)

                                              bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

                                              bull But canrsquot reorder outer loop for 25D need another ideabull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = AB

                                              ndash Dependencies ok 25D works just different semiringbull Kleenersquos Algorithm

                                              52

                                              for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

                                              D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

                                              Performance of 25D APSP using Kleene

                                              53

                                              Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

                                              62xspeedup

                                              2x speedup

                                              What about sparse matrices (23)

                                              bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of G(A)

                                              have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

                                              bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering for

                                              2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

                                              bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

                                              multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

                                              separators)

                                              54

                                              What about sparse matrices (33)

                                              bull If matrix stays very sparse lower bound unattainable new one bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

                                              data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

                                              bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

                                              Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

                                              bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

                                              along dimensions most likely to minimize cost55

                                              Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                              ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                              ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                              bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                              bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                              Symmetric Eigenproblem and SVD

                                              bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd

                                              bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12

                                              ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea

                                              b+1

                                              b+1

                                              Successive Band Reduction (BischofLangSun)

                                              1

                                              b+1

                                              b+1

                                              d+1

                                              c

                                              Successive Band Reduction (BischofLangSun)

                                              b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                              1Q1

                                              b+1

                                              b+1

                                              d+1

                                              c

                                              b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                              Successive Band Reduction (BischofLangSun)

                                              12

                                              Q1

                                              b+1

                                              b+1

                                              d+1

                                              d+c

                                              d+c

                                              c

                                              b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                              Successive Band Reduction (BischofLangSun)

                                              1

                                              12

                                              Q1

                                              Q1T

                                              b+1

                                              b+1

                                              d+1

                                              d+1

                                              cd+c

                                              d+c

                                              c

                                              b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                              Successive Band Reduction (BischofLangSun)

                                              1

                                              1

                                              2

                                              2Q1

                                              Q1T

                                              b+1

                                              b+1

                                              d+1

                                              d+1

                                              cd+c

                                              d+c

                                              d+c

                                              d+c

                                              c

                                              b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                              Successive Band Reduction (BischofLangSun)

                                              1

                                              1

                                              2

                                              2

                                              3

                                              3

                                              Q1

                                              Q1T

                                              Q2

                                              Q2T

                                              b+1

                                              b+1

                                              d+1

                                              d+1

                                              d+c

                                              d+c

                                              d+c

                                              d+c

                                              c

                                              c

                                              b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                              Successive Band Reduction (BischofLangSun)

                                              1

                                              1

                                              2

                                              2

                                              3

                                              3

                                              4

                                              4

                                              Q1

                                              Q1T

                                              Q2

                                              Q2T

                                              Q3

                                              Q3T

                                              b+1

                                              b+1

                                              d+1

                                              d+1

                                              d+c

                                              d+c

                                              d+c

                                              d+c

                                              c

                                              c

                                              b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                              Successive Band Reduction (BischofLangSun)

                                              1

                                              1

                                              2

                                              2

                                              3

                                              3

                                              4

                                              4

                                              5

                                              5

                                              Q1

                                              Q1T

                                              Q2

                                              Q2T

                                              Q3

                                              Q3T

                                              Q4

                                              Q4T

                                              b+1

                                              b+1

                                              d+1

                                              d+1

                                              c

                                              c

                                              d+c

                                              d+c

                                              d+c

                                              d+c

                                              b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                              Successive Band Reduction (BischofLangSun)

                                              1

                                              1

                                              2

                                              2

                                              3

                                              3

                                              4

                                              4

                                              5

                                              5

                                              Q5T

                                              Q1

                                              Q1T

                                              Q2

                                              Q2T

                                              Q3

                                              Q3T

                                              Q5

                                              Q4

                                              Q4T

                                              b+1

                                              b+1

                                              d+1

                                              d+1

                                              c

                                              c

                                              d+c

                                              d+c

                                              d+c

                                              d+c

                                              b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                              Successive Band Reduction (BischofLangSun)

                                              1

                                              1

                                              2

                                              2

                                              3

                                              3

                                              4

                                              4

                                              5

                                              5

                                              6

                                              6

                                              Q5T

                                              Q1

                                              Q1T

                                              Q2

                                              Q2T

                                              Q3

                                              Q3T

                                              Q5

                                              Q4

                                              Q4T

                                              b+1

                                              b+1

                                              d+1

                                              d+1

                                              c

                                              c

                                              d+c

                                              d+c

                                              d+c

                                              d+c

                                              b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                              Successive Band Reduction (BischofLangSun)

                                              Conventional vs CA - SBR

                                              Conventional Communication-Avoiding

                                              Touch all data 4 times Touch all data once

                                              >
                                              >

                                              Speedups of Sym Band Reductionvs DSBTRD

                                              bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

                                              bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

                                              bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

                                              bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

                                              bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

                                              Nonsymmetric Eigenproblem

                                              bull No apparent way to modify standard algorithmbull Instead Spectral Divide-and-Conquer

                                              ndash Find orthogonal matrix Q whose leading columns span an invariant subspace of A

                                              ndash QTAQ will be block upper triangularndash Apply recursively to A11 A22

                                              ndash Depends on randomization1 Randomized Rank Revealing QR decomposition2 Randomized location to try splitting spectrum

                                              A11 A12

                                              ε A22

                                              Attaining the Lower bounds SequentialLegend[Existing][Ours][Math-Lib][Random]

                                              Two Levels Memory Hierarchy

                                              Words Messages Words Messages

                                              BLAS-3 [FLPRrsquo99][BDLSTrsquo13][MKL etc] [FLPRrsquo99][BDLSTrsquo13][MKL etc]

                                              Cholesky[Grsquo97][APrsquo00]

                                              [LAPACK][BDHSrsquo09]

                                              [Grsquo97][APrsquo00][BDHSrsquo09] [Grsquo97][APrsquo00][BDHSrsquo09]

                                              Sym Indefinite [BBDDDPSTYrsquo13] [BBDDDPSTYrsquo13]

                                              LU[Grsquo97][Trsquo97]

                                              [GDXrsquo11][BDLSTrsquo13]

                                              [GDXrsquo11][BDLSTrsquo13]

                                              [Grsquo97][Trsquo97] [BDLSTrsquo13] [BDLSTrsquo13]

                                              QR[EGrsquo98][FWrsquo03]

                                              [DGHLrsquo12][BDLSTrsquo13]

                                              [FWrsquo03][DGHLrsquo12][BDLSTrsquo13]

                                              [EGrsquo98][FWrsquo03][BDLSTrsquo13]

                                              [FWrsquo03][BDLSTrsquo13]

                                              Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

                                              Sym Eig amp SVD [BDDrsquo11][BDKrsquo13] [BDDrsquo11]

                                              Non Sym Eig [BDDrsquo11] [BDDrsquo11]

                                              Legend[Existing][Ours][Math-Lib][Random]

                                              Words (BW) Messages (L) Saving factor

                                              BLAS-3 [AGZrsquo94][MTrsquo99][ScaLAPACK][Crsquo69][vGWrsquo97][SDrsquo11] L nP12

                                              Cholesky [ScaLAPACK][Trsquo99][SDrsquo11] L nP12

                                              Sym Indefinite [BBDDDPSTYrsquo13][ScaLAPACK] [BBDDDPSTYrsquo13] L nP12

                                              LU [ScaLAPACK][GDXrsquo11][Trsquo99][SDrsquo11] [GDXrsquo11][Trsquo99][SDrsquo11] L nP12

                                              QR [ScaLAPACK][DGHLrsquo12] [Trsquo99] [DGHLrsquo12][Trsquo99] L nP12

                                              Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

                                              Sym Eig amp SVD [BDDrsquo11][BDKrsquo13][ScaLAPACK] [BDDrsquo11][BDKrsquo13] L nP12

                                              Non-Sym Eig [BDDrsquo11] [BDDrsquo11] BW P12 L n

                                              Attaining with extra memory 25D M=(cn2P)

                                              Attaining the Lower bounds Parallel 2DM=(n2P)(Ignoring poly-log(P) factors words = ( n2 P12) messages = (P12)

                                              Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                              ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                              ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                              bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                              bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                              Avoiding Communication in Iterative Linear Algebra

                                              bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

                                              bull Conjugate Gradients (CG) GMRES Lanczos Arnoldi hellip bull Goal minimize communication

                                              ndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

                                              bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

                                              ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

                                              bull Lots of speed up possible (modeled and measured)ndash Price some redundant computationndash Challenges Poor partitioning Preconditioning Num Stability

                                              75

                                              Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                              ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                              ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                              bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                              bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                              Example The Difficulty of Tuning SpMV

                                              bull n = 21200bull nnz = 15 M

                                              bull Source NASA structural analysis problem (raefsky)

                                              77

                                              Example The Difficulty of Tuning

                                              bull n = 21200bull nnz = 15 M

                                              bull Source NASA structural analysis problem (raefsky)

                                              bull 8x8 dense substructure exploit this to limit mem_refs

                                              78

                                              Speedups on Itanium 2 The Need for Search

                                              Reference

                                              Best 4x2

                                              Mflops

                                              Mflops

                                              79

                                              Register Profile Itanium 2

                                              190 Mflops

                                              1190 Mflops

                                              80

                                              Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16

                                              Itanium 2 - 33Itanium 1 - 8

                                              252 Mflops

                                              122 Mflops

                                              820 Mflops

                                              459 Mflops

                                              247 Mflops

                                              107 Mflops

                                              12 Gflops

                                              190 Mflops

                                              Another example of tuning challenges for SpMV

                                              bull Ex11 matrix (fluid flow)

                                              bull More complicated non-zero structure in general

                                              bull N = 16614bull NNZ = 11M

                                              82

                                              Zoom in to top corner

                                              bull More complicated non-zero structure in general

                                              bull N = 16614bull NNZ = 11M

                                              83

                                              3x3 blocks look natural buthellip

                                              bull Example 3x3 blockingndash Logical grid of 3x3 cells

                                              bull But would lead to lots of ldquofill-inrdquo

                                              84

                                              Extra Work Can Improve Efficiency

                                              bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

                                              bull On Pentium III 15x speedupndash Actual mflop rate 152 = 225 higher

                                              85

                                              Source Accelerator Cavity Design Problem (Ko via Husbands)

                                              86

                                              100x100 Submatrix Along Diagonal

                                              Summer School Lecture 787

                                              Post-RCM Reordering

                                              88

                                              Effect of Combined RCM+TSP Reordering

                                              Before Green + RedAfter Green + Blue

                                              Summer School Lecture 789

                                              2x speedups on Pentium 4 Power 4 hellip

                                              Summary of Other Performance Optimizations

                                              bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

                                              bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

                                              bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

                                              90

                                              Optimized Sparse Kernel Interface - OSKI

                                              bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

                                              bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

                                              bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

                                              software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

                                              91

                                              Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                              ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                              ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                              bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                              bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                              93

                                              Example Classical Conjugate Gradient (CG)

                                              SpMVs and dot products require communication in

                                              each iteration

                                              via CA Matrix Powers Kernel

                                              Global reduction to compute G

                                              94

                                              Example CA-Conjugate Gradient

                                              Local computations within inner loop require

                                              no communication

                                              Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                              ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                              ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                              bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                              bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                              96

                                              Slower convergence due

                                              to roundoff

                                              Loss of accuracy due to roundoff

                                              At s = 16 monomial basis is rank deficient Method breaks down

                                              Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

                                              CA-CG (monomial)CG

                                              machine precision

                                              97

                                              Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                              ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                              ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                              bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                              bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                              What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

                                              bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

                                              bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

                                              matrices

                                              Explicit (O(nnz)) Implicit (o(nnz))

                                              Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

                                              Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

                                              Indices

                                              Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                              ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                              ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                              bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                              bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                              101

                                              bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

                                              ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

                                              demanded by customers (construction engineers) otherwise they donrsquot believe results

                                              ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

                                              ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

                                              Reproducible Floating Point Computation

                                              Absolute Error for Random Vectors

                                              Same magnitude opposite signs

                                              Intel MKL non-reproducibility

                                              Relative Error for Orthogonal vectors

                                              Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

                                              Sign notreproducible

                                              103

                                              bull Consider summation or dot productbull Goals

                                              1 Same answer independent of layout processors order of summands

                                              2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

                                              bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

                                              GoalsApproaches for Reproducibility

                                              104

                                              Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

                                              Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

                                              Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

                                              bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

                                              Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

                                              bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

                                              Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

                                              Instruments NEC Nokia NVIDIA Samsung Oracle

                                              bull bebopcsberkeleyedu

                                              Summary

                                              Donrsquot Communichellip

                                              106

                                              Time to redesign all linear algebra n-body hellip algorithms and software

                                              (and compilers)

                                              • Implementing Communication-Avoiding Algorithms
                                              • Why avoid communication
                                              • Goals
                                              • Outline
                                              • Outline (2)
                                              • Lower bound for all ldquon3-likerdquo linear algebra
                                              • Lower bound for all ldquon3-likerdquo linear algebra (2)
                                              • Lower bound for all ldquon3-likerdquo linear algebra (3)
                                              • Limits to parallel scaling (12)
                                              • Limits to parallel scaling (22)
                                              • Can we attain these lower bounds
                                              • Outline (3)
                                              • 25D Matrix Multiplication
                                              • 25D Matrix Multiplication (2)
                                              • 25D Matmul on BGP 16K nodes 64K cores (2)
                                              • Perfect Strong Scaling ndash in Time and Energy (12)
                                              • Perfect Strong Scaling ndash in Time and Energy (22)
                                              • Handling Heterogeneity
                                              • Application to Tensor Contractions
                                              • C(ijk) = Σm A(ijm)B(mk)
                                              • Application to Tensor Contractions (2)
                                              • Communication Lower Bounds for Strassen-like matmul algorithms
                                              • vs
                                              • Slide 26
                                              • Strassen-like beyond matmul
                                              • Cache and Network Oblivious Algorithms
                                              • CARMA Performance Distributed Memory
                                              • CARMA Performance Distributed Memory (2)
                                              • CARMA Performance Shared Memory
                                              • CARMA Performance Shared Memory (2)
                                              • Why is CARMA Faster in Shared Memory
                                              • Outline (4)
                                              • One-sided Factorizations (LU QR) so far
                                              • TSQR An Architecture-Dependent Algorithm
                                              • Back to LU Using similar idea for TSLU as TSQR Use reduction
                                              • Minimizing Communication in TSLU
                                              • Making TSLU Numerically Stable
                                              • Stability of LU using TSLU CALU
                                              • Why is stability of TSLU just a ldquoThmrdquo
                                              • Fixing TSLU
                                              • 2D CALU with Tournament Pivoting
                                              • 25D CALU with Tournament Pivoting (c=4 copies)
                                              • Exascale Machine Parameters Source DOE Exascale Workshop
                                              • Exascale predicted speedups for Gaussian Elimination 2D CA
                                              • 25D vs 2D LU With and Without Pivoting
                                              • Other CA algorithms for Ax=b least squares(13)
                                              • Other CA algorithms for Ax=b least squares (23)
                                              • Other CA algorithms for Ax=b least squares (33)
                                              • Outline (5)
                                              • What about sparse matrices (13)
                                              • Performance of 25D APSP using Kleene
                                              • What about sparse matrices (23)
                                              • What about sparse matrices (33)
                                              • Outline (6)
                                              • Symmetric Eigenproblem and SVD
                                              • Slide 58
                                              • Slide 59
                                              • Slide 60
                                              • Slide 61
                                              • Slide 62
                                              • Slide 63
                                              • Slide 64
                                              • Slide 65
                                              • Slide 66
                                              • Slide 67
                                              • Slide 68
                                              • Conventional vs CA - SBR
                                              • Speedups of Sym Band Reduction vs DSBTRD
                                              • Nonsymmetric Eigenproblem
                                              • Attaining the Lower bounds Sequential
                                              • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
                                              • Outline (7)
                                              • Avoiding Communication in Iterative Linear Algebra
                                              • Outline (8)
                                              • Example The Difficulty of Tuning SpMV
                                              • Example The Difficulty of Tuning
                                              • Speedups on Itanium 2 The Need for Search
                                              • Register Profile Itanium 2
                                              • Register Profiles IBM and Intel IA-64
                                              • Another example of tuning challenges for SpMV
                                              • Zoom in to top corner
                                              • 3x3 blocks look natural buthellip
                                              • Extra Work Can Improve Efficiency
                                              • Slide 86
                                              • Slide 87
                                              • Slide 88
                                              • Slide 89
                                              • Summary of Other Performance Optimizations
                                              • Optimized Sparse Kernel Interface - OSKI
                                              • Outline (9)
                                              • Example Classical Conjugate Gradient (CG)
                                              • Example CA-Conjugate Gradient
                                              • Outline (10)
                                              • Slide 96
                                              • Slide 97
                                              • Outline (11)
                                              • What is a ldquosparse matrixrdquo
                                              • Outline (12)
                                              • Reproducible Floating Point Computation
                                              • Intel MKL non-reproducibility
                                              • GoalsApproaches for Reproducibility
                                              • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
                                              • Collaborators and Supporters
                                              • Summary

                                                26

                                                Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

                                                Speedups 24-184(over previous Strassen-based algorithms)

                                                Invited to appear as Research Highlight in CACM

                                                Strassen-like beyond matmul

                                                bull Thm (D Dumitriu Holtzrsquo07) Any Strassen-like O(nω) matmul algorithm can be used to build a numerically stable O(nω+η) algorithm for any ηgt0 for Ax=b least squares eig SVD hellipndash ηgt0 needed to deal with numerical stabilityndash Strassen already stable so η=0

                                                bull Thm For sequential versions of these algorithms Words_moved = O(nω+ηM(ω+η)2 ndash 1 + n2 log n) ie attain expected lower bound

                                                Ballard D Holtz Schwartz

                                                Cache and Network Oblivious Algorithms

                                                bull Motivation Minimizes communication at every level of a hierarchical system without tuning parameters (in theory)ndash Not always 25D Matmul on BGP was topology aware

                                                bull CAPS Divide-and-conquer choose BFS or DFS to adapt to processors available memory

                                                bull CARMAndash Divide-and-conquer classical matmul divide largest of 3

                                                dimensions to create two subproblemsndash Choose BFS or DFS to adapt to processors available memory

                                                CARMA Performance Distributed Memory

                                                Square m = k = n = 6144

                                                ScaLAPACK

                                                CARMA

                                                Peak

                                                (log)

                                                (log)

                                                Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA

                                                CARMA Performance Distributed Memory

                                                Inner Product m = n = 192 k = 6291456

                                                ScaLAPACK

                                                CARMAPeak

                                                (log)

                                                (log)

                                                Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA

                                                CARMA Performance Shared Memory

                                                Square m = k = n

                                                MKL (double)CARMA (double)

                                                MKL (single)CARMA (single)

                                                Peak (single)

                                                Peak (double)

                                                (log)

                                                (linear)

                                                Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

                                                CARMA Performance Shared Memory

                                                Inner Product m = n = 64

                                                MKL (double)

                                                CARMA (double)

                                                MKL (single)

                                                CARMA (single)

                                                (log)

                                                (linear)

                                                Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

                                                Why is CARMA Faster in Shared MemoryL3 Cache Misses

                                                Shared Memory Inner Product (m = n = 64 k = 524288)

                                                97 Fewer Misses

                                                86 Fewer Misses

                                                (linear)

                                                Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                One-sided Factorizations (LU QR) so farbull Classical Approach for i=1 to n update column i update trailing matrixbull words_moved = O(n3)

                                                35

                                                bull Blocked Approach (LAPACK) for i=1 to nb update block i of b columns update trailing matrixbull words moved = O(n3M13)

                                                bull Recursive Approach func factor(A) if A has 1 column update it

                                                else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)

                                                bull None of these approaches minimizes messagesbull Parallel case Partial

                                                Pivoting =gt n reductionsbull Need another idea

                                                TSQR An Architecture-Dependent Algorithm

                                                W =

                                                W0

                                                W1

                                                W2

                                                W3

                                                R00

                                                R10

                                                R20

                                                R30

                                                R01

                                                R11

                                                R02Parallel

                                                W =

                                                W0

                                                W1

                                                W2

                                                W3

                                                R01 R02

                                                R00

                                                R03

                                                SequentialStreaming

                                                W =

                                                W0

                                                W1

                                                W2

                                                W3

                                                R00

                                                R01R01

                                                R11

                                                R02

                                                R11

                                                R03

                                                Dual Core

                                                Can choose reduction tree dynamicallyMulticore Multisocket Multirack Multisite Out-of-core

                                                Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

                                                Wnxb =

                                                W1

                                                W2

                                                W3

                                                W4

                                                P1middotL1middotU1

                                                P2middotL2middotU2

                                                P3middotL3middotU3

                                                P4middotL4middotU4

                                                =

                                                Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

                                                W1rsquoW2rsquoW3rsquoW4rsquo

                                                P12middotL12middotU12

                                                P34middotL34middotU34

                                                =Choose b pivot rows call them W12rsquo

                                                Choose b pivot rows call them W34rsquo

                                                W12rsquoW34rsquo

                                                = P1234middotL1234middotU1234

                                                Choose b pivot rows

                                                Go back to W and use these b pivot rows (move them to top do LU without pivoting)

                                                37

                                                Minimizing Communication in TSLU

                                                W = W1

                                                W2

                                                W3

                                                W4

                                                LULULULU

                                                LU

                                                LULUParallel

                                                W = W1

                                                W2

                                                W3

                                                W4

                                                LULU

                                                LU

                                                LUSequentialStreaming

                                                W = W1

                                                W2

                                                W3

                                                W4

                                                LULU LU

                                                LULU

                                                LULU

                                                Dual Core

                                                Can choose reduction tree dynamically to match architecture as before

                                                38

                                                Making TSLU Numerically Stable

                                                bull Details matterndash Going up the tree we could do LU either on original rows of A

                                                (tournament pivoting) or computed rows of Undash Only tournament pivoting stable

                                                bull ldquoThmrdquo New scheme as stable as Partial Pivoting (GEPP) in following sense Get same Schur complements as GEPP applied to different input matrix whose entries are blocks taken from input A

                                                bull Why just a ldquoThmrdquo

                                                39

                                                Stability of LU using TSLU CALU

                                                Summer School Lecture 4 40

                                                bull Empirical testingndash Both random matrices and ldquospecial onesrdquondash Both binary tree (BCALU) and flat-tree (FCALU)ndash 3 metrics ||PA-LU||||A|| normwise and componentwise backward errorsndash See [D Grigori Xiang 2010] for details

                                                Why is stability of TSLU just a ldquoThmrdquo

                                                bull Proof is correct ndash in exact arithmeticbull Experiment

                                                ndash Generate 100 random 6x6 rank 3 matrices in Matlabndash [LUP] = lu(A) do LU without pivoting on PA compare L factors are

                                                they the samebull Compute || L ndash Lnp || A few 0rsquos A few infinrsquos a few NaNsbull Rest mostly O(1)

                                                ndash Why Floating point is nonassociative doing arithmetic in different order gives different rounding errors

                                                ndash Same experiment with rank 6 matrices || L ndash Lnp || usually nonzero O(macheps)

                                                ndash Same experiment with 20x20 rank 4 matrices || L ndash Lnp || often O(103)

                                                bull Much harder to break TSLU but possiblendash Occurred when using TSLU to factorize a low-rank subdiagonal

                                                panel in symmetric-indefinite factorization 41

                                                Fixing TSLU

                                                bull Run TSLU quickly test for stability fix if necessary (rare)

                                                bull Test conditioning of U if not tiny (usual case) proceed elsebull Compute || L || if not big (usual case) proceed elsebull Factor A = QR using TSQR thenbull Factor Q = PLU using TSLU thenbull A = PL(UR) with UR as upper triangular factor

                                                bull Last topic in lecture how to guarantee floating point reproducibility

                                                42

                                                2D CALU with Tournament Pivoting

                                                43

                                                25D CALU with Tournament Pivoting (c=4 copies)

                                                44

                                                Exascale Machine ParametersSource DOE Exascale Workshop

                                                bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

                                                Exascale predicted speedupsfor Gaussian Elimination

                                                2D CA-LU vs ScaLAPACK-LU

                                                log2 (p)

                                                log 2

                                                (n2 p

                                                ) =

                                                log 2

                                                (mem

                                                ory_

                                                per_

                                                proc

                                                )

                                                Up to 29x

                                                25D vs 2D LUWith and Without Pivoting

                                                Other CA algorithms for Ax=b least squares(13)

                                                bull A symmetric and indefinitendash Seek factorization that retains symmetry PAPT = LDLT D

                                                ldquosimplerdquobull Save frac12 flops preserve inertia

                                                ndash Usual approach Bunch-Kaufmanbull D block diagonal with 1x1 and 2x2 blocksbull Pivot search down column along row (lots of communication)

                                                ndash Alternative Aasenbull D = tridiagonal = Tbull Two steps

                                                ndash PAPT = LTLT where T is banded using TSLU

                                                48

                                                0 0

                                                0

                                                0 0

                                                0

                                                0

                                                hellip

                                                hellip

                                                ndash Solvefactor narrow band problem with Tbull Up to 28x faster than MKL Best Paper at IPDPSrsquo13

                                                Other CA algorithms for Ax=b least squares (23)bull Minimizing bandwidth and latency for sequential GEPP

                                                ndash So far could not do partial pivoting and minimize messages just words

                                                ndash Challengebull Column layout good for choosing pivots bad for matmulbull Blocked layout good for matmul bad for choosing pivots

                                                ndash Solution use both layouts switching between thembull ldquoShape Morphing LUrdquo or SMLU

                                                49

                                                bull func factor(A) if A has 1 column update it else factor(left half of A)

                                                update right half of A

                                                factor(right half of A)

                                                bull Words = O(n3M12)

                                                bull Messages = O(n3M)

                                                bull func factor(A) if A has 1 column update it else factor(left half of A) reshape to recursive block format update right half of A reshape to columnwise format factor(right half of A)

                                                bull Words = O(n3M12)

                                                bull Messages = O(n3M32)

                                                Other CA algorithms for Ax=b least squares (33)bull Need for pivoting arises beyond LU in QR

                                                ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)

                                                ndash Usual approach like Partial Pivotingbull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound

                                                ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two

                                                groups of b columns either using usual approach or something better (GuEisenstat)

                                                bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix

                                                ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 50

                                                Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                What about sparse matrices (13)

                                                bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

                                                bull But canrsquot reorder outer loop for 25D need another ideabull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = AB

                                                ndash Dependencies ok 25D works just different semiringbull Kleenersquos Algorithm

                                                52

                                                for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

                                                D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

                                                Performance of 25D APSP using Kleene

                                                53

                                                Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

                                                62xspeedup

                                                2x speedup

                                                What about sparse matrices (23)

                                                bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of G(A)

                                                have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

                                                bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering for

                                                2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

                                                bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

                                                multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

                                                separators)

                                                54

                                                What about sparse matrices (33)

                                                bull If matrix stays very sparse lower bound unattainable new one bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

                                                data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

                                                bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

                                                Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

                                                bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

                                                along dimensions most likely to minimize cost55

                                                Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                Symmetric Eigenproblem and SVD

                                                bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd

                                                bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12

                                                ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea

                                                b+1

                                                b+1

                                                Successive Band Reduction (BischofLangSun)

                                                1

                                                b+1

                                                b+1

                                                d+1

                                                c

                                                Successive Band Reduction (BischofLangSun)

                                                b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                1Q1

                                                b+1

                                                b+1

                                                d+1

                                                c

                                                b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                Successive Band Reduction (BischofLangSun)

                                                12

                                                Q1

                                                b+1

                                                b+1

                                                d+1

                                                d+c

                                                d+c

                                                c

                                                b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                Successive Band Reduction (BischofLangSun)

                                                1

                                                12

                                                Q1

                                                Q1T

                                                b+1

                                                b+1

                                                d+1

                                                d+1

                                                cd+c

                                                d+c

                                                c

                                                b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                Successive Band Reduction (BischofLangSun)

                                                1

                                                1

                                                2

                                                2Q1

                                                Q1T

                                                b+1

                                                b+1

                                                d+1

                                                d+1

                                                cd+c

                                                d+c

                                                d+c

                                                d+c

                                                c

                                                b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                Successive Band Reduction (BischofLangSun)

                                                1

                                                1

                                                2

                                                2

                                                3

                                                3

                                                Q1

                                                Q1T

                                                Q2

                                                Q2T

                                                b+1

                                                b+1

                                                d+1

                                                d+1

                                                d+c

                                                d+c

                                                d+c

                                                d+c

                                                c

                                                c

                                                b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                Successive Band Reduction (BischofLangSun)

                                                1

                                                1

                                                2

                                                2

                                                3

                                                3

                                                4

                                                4

                                                Q1

                                                Q1T

                                                Q2

                                                Q2T

                                                Q3

                                                Q3T

                                                b+1

                                                b+1

                                                d+1

                                                d+1

                                                d+c

                                                d+c

                                                d+c

                                                d+c

                                                c

                                                c

                                                b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                Successive Band Reduction (BischofLangSun)

                                                1

                                                1

                                                2

                                                2

                                                3

                                                3

                                                4

                                                4

                                                5

                                                5

                                                Q1

                                                Q1T

                                                Q2

                                                Q2T

                                                Q3

                                                Q3T

                                                Q4

                                                Q4T

                                                b+1

                                                b+1

                                                d+1

                                                d+1

                                                c

                                                c

                                                d+c

                                                d+c

                                                d+c

                                                d+c

                                                b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                Successive Band Reduction (BischofLangSun)

                                                1

                                                1

                                                2

                                                2

                                                3

                                                3

                                                4

                                                4

                                                5

                                                5

                                                Q5T

                                                Q1

                                                Q1T

                                                Q2

                                                Q2T

                                                Q3

                                                Q3T

                                                Q5

                                                Q4

                                                Q4T

                                                b+1

                                                b+1

                                                d+1

                                                d+1

                                                c

                                                c

                                                d+c

                                                d+c

                                                d+c

                                                d+c

                                                b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                Successive Band Reduction (BischofLangSun)

                                                1

                                                1

                                                2

                                                2

                                                3

                                                3

                                                4

                                                4

                                                5

                                                5

                                                6

                                                6

                                                Q5T

                                                Q1

                                                Q1T

                                                Q2

                                                Q2T

                                                Q3

                                                Q3T

                                                Q5

                                                Q4

                                                Q4T

                                                b+1

                                                b+1

                                                d+1

                                                d+1

                                                c

                                                c

                                                d+c

                                                d+c

                                                d+c

                                                d+c

                                                b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                Successive Band Reduction (BischofLangSun)

                                                Conventional vs CA - SBR

                                                Conventional Communication-Avoiding

                                                Touch all data 4 times Touch all data once

                                                >
                                                >

                                                Speedups of Sym Band Reductionvs DSBTRD

                                                bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

                                                bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

                                                bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

                                                bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

                                                bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

                                                Nonsymmetric Eigenproblem

                                                bull No apparent way to modify standard algorithmbull Instead Spectral Divide-and-Conquer

                                                ndash Find orthogonal matrix Q whose leading columns span an invariant subspace of A

                                                ndash QTAQ will be block upper triangularndash Apply recursively to A11 A22

                                                ndash Depends on randomization1 Randomized Rank Revealing QR decomposition2 Randomized location to try splitting spectrum

                                                A11 A12

                                                ε A22

                                                Attaining the Lower bounds SequentialLegend[Existing][Ours][Math-Lib][Random]

                                                Two Levels Memory Hierarchy

                                                Words Messages Words Messages

                                                BLAS-3 [FLPRrsquo99][BDLSTrsquo13][MKL etc] [FLPRrsquo99][BDLSTrsquo13][MKL etc]

                                                Cholesky[Grsquo97][APrsquo00]

                                                [LAPACK][BDHSrsquo09]

                                                [Grsquo97][APrsquo00][BDHSrsquo09] [Grsquo97][APrsquo00][BDHSrsquo09]

                                                Sym Indefinite [BBDDDPSTYrsquo13] [BBDDDPSTYrsquo13]

                                                LU[Grsquo97][Trsquo97]

                                                [GDXrsquo11][BDLSTrsquo13]

                                                [GDXrsquo11][BDLSTrsquo13]

                                                [Grsquo97][Trsquo97] [BDLSTrsquo13] [BDLSTrsquo13]

                                                QR[EGrsquo98][FWrsquo03]

                                                [DGHLrsquo12][BDLSTrsquo13]

                                                [FWrsquo03][DGHLrsquo12][BDLSTrsquo13]

                                                [EGrsquo98][FWrsquo03][BDLSTrsquo13]

                                                [FWrsquo03][BDLSTrsquo13]

                                                Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

                                                Sym Eig amp SVD [BDDrsquo11][BDKrsquo13] [BDDrsquo11]

                                                Non Sym Eig [BDDrsquo11] [BDDrsquo11]

                                                Legend[Existing][Ours][Math-Lib][Random]

                                                Words (BW) Messages (L) Saving factor

                                                BLAS-3 [AGZrsquo94][MTrsquo99][ScaLAPACK][Crsquo69][vGWrsquo97][SDrsquo11] L nP12

                                                Cholesky [ScaLAPACK][Trsquo99][SDrsquo11] L nP12

                                                Sym Indefinite [BBDDDPSTYrsquo13][ScaLAPACK] [BBDDDPSTYrsquo13] L nP12

                                                LU [ScaLAPACK][GDXrsquo11][Trsquo99][SDrsquo11] [GDXrsquo11][Trsquo99][SDrsquo11] L nP12

                                                QR [ScaLAPACK][DGHLrsquo12] [Trsquo99] [DGHLrsquo12][Trsquo99] L nP12

                                                Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

                                                Sym Eig amp SVD [BDDrsquo11][BDKrsquo13][ScaLAPACK] [BDDrsquo11][BDKrsquo13] L nP12

                                                Non-Sym Eig [BDDrsquo11] [BDDrsquo11] BW P12 L n

                                                Attaining with extra memory 25D M=(cn2P)

                                                Attaining the Lower bounds Parallel 2DM=(n2P)(Ignoring poly-log(P) factors words = ( n2 P12) messages = (P12)

                                                Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                Avoiding Communication in Iterative Linear Algebra

                                                bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

                                                bull Conjugate Gradients (CG) GMRES Lanczos Arnoldi hellip bull Goal minimize communication

                                                ndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

                                                bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

                                                ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

                                                bull Lots of speed up possible (modeled and measured)ndash Price some redundant computationndash Challenges Poor partitioning Preconditioning Num Stability

                                                75

                                                Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                Example The Difficulty of Tuning SpMV

                                                bull n = 21200bull nnz = 15 M

                                                bull Source NASA structural analysis problem (raefsky)

                                                77

                                                Example The Difficulty of Tuning

                                                bull n = 21200bull nnz = 15 M

                                                bull Source NASA structural analysis problem (raefsky)

                                                bull 8x8 dense substructure exploit this to limit mem_refs

                                                78

                                                Speedups on Itanium 2 The Need for Search

                                                Reference

                                                Best 4x2

                                                Mflops

                                                Mflops

                                                79

                                                Register Profile Itanium 2

                                                190 Mflops

                                                1190 Mflops

                                                80

                                                Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16

                                                Itanium 2 - 33Itanium 1 - 8

                                                252 Mflops

                                                122 Mflops

                                                820 Mflops

                                                459 Mflops

                                                247 Mflops

                                                107 Mflops

                                                12 Gflops

                                                190 Mflops

                                                Another example of tuning challenges for SpMV

                                                bull Ex11 matrix (fluid flow)

                                                bull More complicated non-zero structure in general

                                                bull N = 16614bull NNZ = 11M

                                                82

                                                Zoom in to top corner

                                                bull More complicated non-zero structure in general

                                                bull N = 16614bull NNZ = 11M

                                                83

                                                3x3 blocks look natural buthellip

                                                bull Example 3x3 blockingndash Logical grid of 3x3 cells

                                                bull But would lead to lots of ldquofill-inrdquo

                                                84

                                                Extra Work Can Improve Efficiency

                                                bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

                                                bull On Pentium III 15x speedupndash Actual mflop rate 152 = 225 higher

                                                85

                                                Source Accelerator Cavity Design Problem (Ko via Husbands)

                                                86

                                                100x100 Submatrix Along Diagonal

                                                Summer School Lecture 787

                                                Post-RCM Reordering

                                                88

                                                Effect of Combined RCM+TSP Reordering

                                                Before Green + RedAfter Green + Blue

                                                Summer School Lecture 789

                                                2x speedups on Pentium 4 Power 4 hellip

                                                Summary of Other Performance Optimizations

                                                bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

                                                bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

                                                bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

                                                90

                                                Optimized Sparse Kernel Interface - OSKI

                                                bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

                                                bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

                                                bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

                                                software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

                                                91

                                                Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                93

                                                Example Classical Conjugate Gradient (CG)

                                                SpMVs and dot products require communication in

                                                each iteration

                                                via CA Matrix Powers Kernel

                                                Global reduction to compute G

                                                94

                                                Example CA-Conjugate Gradient

                                                Local computations within inner loop require

                                                no communication

                                                Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                96

                                                Slower convergence due

                                                to roundoff

                                                Loss of accuracy due to roundoff

                                                At s = 16 monomial basis is rank deficient Method breaks down

                                                Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

                                                CA-CG (monomial)CG

                                                machine precision

                                                97

                                                Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

                                                bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

                                                bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

                                                matrices

                                                Explicit (O(nnz)) Implicit (o(nnz))

                                                Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

                                                Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

                                                Indices

                                                Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                101

                                                bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

                                                ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

                                                demanded by customers (construction engineers) otherwise they donrsquot believe results

                                                ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

                                                ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

                                                Reproducible Floating Point Computation

                                                Absolute Error for Random Vectors

                                                Same magnitude opposite signs

                                                Intel MKL non-reproducibility

                                                Relative Error for Orthogonal vectors

                                                Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

                                                Sign notreproducible

                                                103

                                                bull Consider summation or dot productbull Goals

                                                1 Same answer independent of layout processors order of summands

                                                2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

                                                bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

                                                GoalsApproaches for Reproducibility

                                                104

                                                Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

                                                Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

                                                Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

                                                bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

                                                Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

                                                bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

                                                Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

                                                Instruments NEC Nokia NVIDIA Samsung Oracle

                                                bull bebopcsberkeleyedu

                                                Summary

                                                Donrsquot Communichellip

                                                106

                                                Time to redesign all linear algebra n-body hellip algorithms and software

                                                (and compilers)

                                                • Implementing Communication-Avoiding Algorithms
                                                • Why avoid communication
                                                • Goals
                                                • Outline
                                                • Outline (2)
                                                • Lower bound for all ldquon3-likerdquo linear algebra
                                                • Lower bound for all ldquon3-likerdquo linear algebra (2)
                                                • Lower bound for all ldquon3-likerdquo linear algebra (3)
                                                • Limits to parallel scaling (12)
                                                • Limits to parallel scaling (22)
                                                • Can we attain these lower bounds
                                                • Outline (3)
                                                • 25D Matrix Multiplication
                                                • 25D Matrix Multiplication (2)
                                                • 25D Matmul on BGP 16K nodes 64K cores (2)
                                                • Perfect Strong Scaling ndash in Time and Energy (12)
                                                • Perfect Strong Scaling ndash in Time and Energy (22)
                                                • Handling Heterogeneity
                                                • Application to Tensor Contractions
                                                • C(ijk) = Σm A(ijm)B(mk)
                                                • Application to Tensor Contractions (2)
                                                • Communication Lower Bounds for Strassen-like matmul algorithms
                                                • vs
                                                • Slide 26
                                                • Strassen-like beyond matmul
                                                • Cache and Network Oblivious Algorithms
                                                • CARMA Performance Distributed Memory
                                                • CARMA Performance Distributed Memory (2)
                                                • CARMA Performance Shared Memory
                                                • CARMA Performance Shared Memory (2)
                                                • Why is CARMA Faster in Shared Memory
                                                • Outline (4)
                                                • One-sided Factorizations (LU QR) so far
                                                • TSQR An Architecture-Dependent Algorithm
                                                • Back to LU Using similar idea for TSLU as TSQR Use reduction
                                                • Minimizing Communication in TSLU
                                                • Making TSLU Numerically Stable
                                                • Stability of LU using TSLU CALU
                                                • Why is stability of TSLU just a ldquoThmrdquo
                                                • Fixing TSLU
                                                • 2D CALU with Tournament Pivoting
                                                • 25D CALU with Tournament Pivoting (c=4 copies)
                                                • Exascale Machine Parameters Source DOE Exascale Workshop
                                                • Exascale predicted speedups for Gaussian Elimination 2D CA
                                                • 25D vs 2D LU With and Without Pivoting
                                                • Other CA algorithms for Ax=b least squares(13)
                                                • Other CA algorithms for Ax=b least squares (23)
                                                • Other CA algorithms for Ax=b least squares (33)
                                                • Outline (5)
                                                • What about sparse matrices (13)
                                                • Performance of 25D APSP using Kleene
                                                • What about sparse matrices (23)
                                                • What about sparse matrices (33)
                                                • Outline (6)
                                                • Symmetric Eigenproblem and SVD
                                                • Slide 58
                                                • Slide 59
                                                • Slide 60
                                                • Slide 61
                                                • Slide 62
                                                • Slide 63
                                                • Slide 64
                                                • Slide 65
                                                • Slide 66
                                                • Slide 67
                                                • Slide 68
                                                • Conventional vs CA - SBR
                                                • Speedups of Sym Band Reduction vs DSBTRD
                                                • Nonsymmetric Eigenproblem
                                                • Attaining the Lower bounds Sequential
                                                • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
                                                • Outline (7)
                                                • Avoiding Communication in Iterative Linear Algebra
                                                • Outline (8)
                                                • Example The Difficulty of Tuning SpMV
                                                • Example The Difficulty of Tuning
                                                • Speedups on Itanium 2 The Need for Search
                                                • Register Profile Itanium 2
                                                • Register Profiles IBM and Intel IA-64
                                                • Another example of tuning challenges for SpMV
                                                • Zoom in to top corner
                                                • 3x3 blocks look natural buthellip
                                                • Extra Work Can Improve Efficiency
                                                • Slide 86
                                                • Slide 87
                                                • Slide 88
                                                • Slide 89
                                                • Summary of Other Performance Optimizations
                                                • Optimized Sparse Kernel Interface - OSKI
                                                • Outline (9)
                                                • Example Classical Conjugate Gradient (CG)
                                                • Example CA-Conjugate Gradient
                                                • Outline (10)
                                                • Slide 96
                                                • Slide 97
                                                • Outline (11)
                                                • What is a ldquosparse matrixrdquo
                                                • Outline (12)
                                                • Reproducible Floating Point Computation
                                                • Intel MKL non-reproducibility
                                                • GoalsApproaches for Reproducibility
                                                • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
                                                • Collaborators and Supporters
                                                • Summary

                                                  Strassen-like beyond matmul

                                                  bull Thm (D Dumitriu Holtzrsquo07) Any Strassen-like O(nω) matmul algorithm can be used to build a numerically stable O(nω+η) algorithm for any ηgt0 for Ax=b least squares eig SVD hellipndash ηgt0 needed to deal with numerical stabilityndash Strassen already stable so η=0

                                                  bull Thm For sequential versions of these algorithms Words_moved = O(nω+ηM(ω+η)2 ndash 1 + n2 log n) ie attain expected lower bound

                                                  Ballard D Holtz Schwartz

                                                  Cache and Network Oblivious Algorithms

                                                  bull Motivation Minimizes communication at every level of a hierarchical system without tuning parameters (in theory)ndash Not always 25D Matmul on BGP was topology aware

                                                  bull CAPS Divide-and-conquer choose BFS or DFS to adapt to processors available memory

                                                  bull CARMAndash Divide-and-conquer classical matmul divide largest of 3

                                                  dimensions to create two subproblemsndash Choose BFS or DFS to adapt to processors available memory

                                                  CARMA Performance Distributed Memory

                                                  Square m = k = n = 6144

                                                  ScaLAPACK

                                                  CARMA

                                                  Peak

                                                  (log)

                                                  (log)

                                                  Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA

                                                  CARMA Performance Distributed Memory

                                                  Inner Product m = n = 192 k = 6291456

                                                  ScaLAPACK

                                                  CARMAPeak

                                                  (log)

                                                  (log)

                                                  Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA

                                                  CARMA Performance Shared Memory

                                                  Square m = k = n

                                                  MKL (double)CARMA (double)

                                                  MKL (single)CARMA (single)

                                                  Peak (single)

                                                  Peak (double)

                                                  (log)

                                                  (linear)

                                                  Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

                                                  CARMA Performance Shared Memory

                                                  Inner Product m = n = 64

                                                  MKL (double)

                                                  CARMA (double)

                                                  MKL (single)

                                                  CARMA (single)

                                                  (log)

                                                  (linear)

                                                  Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

                                                  Why is CARMA Faster in Shared MemoryL3 Cache Misses

                                                  Shared Memory Inner Product (m = n = 64 k = 524288)

                                                  97 Fewer Misses

                                                  86 Fewer Misses

                                                  (linear)

                                                  Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                  ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                  ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                  bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                  bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                  One-sided Factorizations (LU QR) so farbull Classical Approach for i=1 to n update column i update trailing matrixbull words_moved = O(n3)

                                                  35

                                                  bull Blocked Approach (LAPACK) for i=1 to nb update block i of b columns update trailing matrixbull words moved = O(n3M13)

                                                  bull Recursive Approach func factor(A) if A has 1 column update it

                                                  else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)

                                                  bull None of these approaches minimizes messagesbull Parallel case Partial

                                                  Pivoting =gt n reductionsbull Need another idea

                                                  TSQR An Architecture-Dependent Algorithm

                                                  W =

                                                  W0

                                                  W1

                                                  W2

                                                  W3

                                                  R00

                                                  R10

                                                  R20

                                                  R30

                                                  R01

                                                  R11

                                                  R02Parallel

                                                  W =

                                                  W0

                                                  W1

                                                  W2

                                                  W3

                                                  R01 R02

                                                  R00

                                                  R03

                                                  SequentialStreaming

                                                  W =

                                                  W0

                                                  W1

                                                  W2

                                                  W3

                                                  R00

                                                  R01R01

                                                  R11

                                                  R02

                                                  R11

                                                  R03

                                                  Dual Core

                                                  Can choose reduction tree dynamicallyMulticore Multisocket Multirack Multisite Out-of-core

                                                  Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

                                                  Wnxb =

                                                  W1

                                                  W2

                                                  W3

                                                  W4

                                                  P1middotL1middotU1

                                                  P2middotL2middotU2

                                                  P3middotL3middotU3

                                                  P4middotL4middotU4

                                                  =

                                                  Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

                                                  W1rsquoW2rsquoW3rsquoW4rsquo

                                                  P12middotL12middotU12

                                                  P34middotL34middotU34

                                                  =Choose b pivot rows call them W12rsquo

                                                  Choose b pivot rows call them W34rsquo

                                                  W12rsquoW34rsquo

                                                  = P1234middotL1234middotU1234

                                                  Choose b pivot rows

                                                  Go back to W and use these b pivot rows (move them to top do LU without pivoting)

                                                  37

                                                  Minimizing Communication in TSLU

                                                  W = W1

                                                  W2

                                                  W3

                                                  W4

                                                  LULULULU

                                                  LU

                                                  LULUParallel

                                                  W = W1

                                                  W2

                                                  W3

                                                  W4

                                                  LULU

                                                  LU

                                                  LUSequentialStreaming

                                                  W = W1

                                                  W2

                                                  W3

                                                  W4

                                                  LULU LU

                                                  LULU

                                                  LULU

                                                  Dual Core

                                                  Can choose reduction tree dynamically to match architecture as before

                                                  38

                                                  Making TSLU Numerically Stable

                                                  bull Details matterndash Going up the tree we could do LU either on original rows of A

                                                  (tournament pivoting) or computed rows of Undash Only tournament pivoting stable

                                                  bull ldquoThmrdquo New scheme as stable as Partial Pivoting (GEPP) in following sense Get same Schur complements as GEPP applied to different input matrix whose entries are blocks taken from input A

                                                  bull Why just a ldquoThmrdquo

                                                  39

                                                  Stability of LU using TSLU CALU

                                                  Summer School Lecture 4 40

                                                  bull Empirical testingndash Both random matrices and ldquospecial onesrdquondash Both binary tree (BCALU) and flat-tree (FCALU)ndash 3 metrics ||PA-LU||||A|| normwise and componentwise backward errorsndash See [D Grigori Xiang 2010] for details

                                                  Why is stability of TSLU just a ldquoThmrdquo

                                                  bull Proof is correct ndash in exact arithmeticbull Experiment

                                                  ndash Generate 100 random 6x6 rank 3 matrices in Matlabndash [LUP] = lu(A) do LU without pivoting on PA compare L factors are

                                                  they the samebull Compute || L ndash Lnp || A few 0rsquos A few infinrsquos a few NaNsbull Rest mostly O(1)

                                                  ndash Why Floating point is nonassociative doing arithmetic in different order gives different rounding errors

                                                  ndash Same experiment with rank 6 matrices || L ndash Lnp || usually nonzero O(macheps)

                                                  ndash Same experiment with 20x20 rank 4 matrices || L ndash Lnp || often O(103)

                                                  bull Much harder to break TSLU but possiblendash Occurred when using TSLU to factorize a low-rank subdiagonal

                                                  panel in symmetric-indefinite factorization 41

                                                  Fixing TSLU

                                                  bull Run TSLU quickly test for stability fix if necessary (rare)

                                                  bull Test conditioning of U if not tiny (usual case) proceed elsebull Compute || L || if not big (usual case) proceed elsebull Factor A = QR using TSQR thenbull Factor Q = PLU using TSLU thenbull A = PL(UR) with UR as upper triangular factor

                                                  bull Last topic in lecture how to guarantee floating point reproducibility

                                                  42

                                                  2D CALU with Tournament Pivoting

                                                  43

                                                  25D CALU with Tournament Pivoting (c=4 copies)

                                                  44

                                                  Exascale Machine ParametersSource DOE Exascale Workshop

                                                  bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

                                                  Exascale predicted speedupsfor Gaussian Elimination

                                                  2D CA-LU vs ScaLAPACK-LU

                                                  log2 (p)

                                                  log 2

                                                  (n2 p

                                                  ) =

                                                  log 2

                                                  (mem

                                                  ory_

                                                  per_

                                                  proc

                                                  )

                                                  Up to 29x

                                                  25D vs 2D LUWith and Without Pivoting

                                                  Other CA algorithms for Ax=b least squares(13)

                                                  bull A symmetric and indefinitendash Seek factorization that retains symmetry PAPT = LDLT D

                                                  ldquosimplerdquobull Save frac12 flops preserve inertia

                                                  ndash Usual approach Bunch-Kaufmanbull D block diagonal with 1x1 and 2x2 blocksbull Pivot search down column along row (lots of communication)

                                                  ndash Alternative Aasenbull D = tridiagonal = Tbull Two steps

                                                  ndash PAPT = LTLT where T is banded using TSLU

                                                  48

                                                  0 0

                                                  0

                                                  0 0

                                                  0

                                                  0

                                                  hellip

                                                  hellip

                                                  ndash Solvefactor narrow band problem with Tbull Up to 28x faster than MKL Best Paper at IPDPSrsquo13

                                                  Other CA algorithms for Ax=b least squares (23)bull Minimizing bandwidth and latency for sequential GEPP

                                                  ndash So far could not do partial pivoting and minimize messages just words

                                                  ndash Challengebull Column layout good for choosing pivots bad for matmulbull Blocked layout good for matmul bad for choosing pivots

                                                  ndash Solution use both layouts switching between thembull ldquoShape Morphing LUrdquo or SMLU

                                                  49

                                                  bull func factor(A) if A has 1 column update it else factor(left half of A)

                                                  update right half of A

                                                  factor(right half of A)

                                                  bull Words = O(n3M12)

                                                  bull Messages = O(n3M)

                                                  bull func factor(A) if A has 1 column update it else factor(left half of A) reshape to recursive block format update right half of A reshape to columnwise format factor(right half of A)

                                                  bull Words = O(n3M12)

                                                  bull Messages = O(n3M32)

                                                  Other CA algorithms for Ax=b least squares (33)bull Need for pivoting arises beyond LU in QR

                                                  ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)

                                                  ndash Usual approach like Partial Pivotingbull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound

                                                  ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two

                                                  groups of b columns either using usual approach or something better (GuEisenstat)

                                                  bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix

                                                  ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 50

                                                  Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                  ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                  ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                  bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                  bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                  What about sparse matrices (13)

                                                  bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

                                                  bull But canrsquot reorder outer loop for 25D need another ideabull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = AB

                                                  ndash Dependencies ok 25D works just different semiringbull Kleenersquos Algorithm

                                                  52

                                                  for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

                                                  D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

                                                  Performance of 25D APSP using Kleene

                                                  53

                                                  Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

                                                  62xspeedup

                                                  2x speedup

                                                  What about sparse matrices (23)

                                                  bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of G(A)

                                                  have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

                                                  bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering for

                                                  2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

                                                  bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

                                                  multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

                                                  separators)

                                                  54

                                                  What about sparse matrices (33)

                                                  bull If matrix stays very sparse lower bound unattainable new one bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

                                                  data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

                                                  bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

                                                  Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

                                                  bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

                                                  along dimensions most likely to minimize cost55

                                                  Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                  ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                  ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                  bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                  bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                  Symmetric Eigenproblem and SVD

                                                  bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd

                                                  bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12

                                                  ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea

                                                  b+1

                                                  b+1

                                                  Successive Band Reduction (BischofLangSun)

                                                  1

                                                  b+1

                                                  b+1

                                                  d+1

                                                  c

                                                  Successive Band Reduction (BischofLangSun)

                                                  b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                  1Q1

                                                  b+1

                                                  b+1

                                                  d+1

                                                  c

                                                  b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                  Successive Band Reduction (BischofLangSun)

                                                  12

                                                  Q1

                                                  b+1

                                                  b+1

                                                  d+1

                                                  d+c

                                                  d+c

                                                  c

                                                  b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                  Successive Band Reduction (BischofLangSun)

                                                  1

                                                  12

                                                  Q1

                                                  Q1T

                                                  b+1

                                                  b+1

                                                  d+1

                                                  d+1

                                                  cd+c

                                                  d+c

                                                  c

                                                  b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                  Successive Band Reduction (BischofLangSun)

                                                  1

                                                  1

                                                  2

                                                  2Q1

                                                  Q1T

                                                  b+1

                                                  b+1

                                                  d+1

                                                  d+1

                                                  cd+c

                                                  d+c

                                                  d+c

                                                  d+c

                                                  c

                                                  b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                  Successive Band Reduction (BischofLangSun)

                                                  1

                                                  1

                                                  2

                                                  2

                                                  3

                                                  3

                                                  Q1

                                                  Q1T

                                                  Q2

                                                  Q2T

                                                  b+1

                                                  b+1

                                                  d+1

                                                  d+1

                                                  d+c

                                                  d+c

                                                  d+c

                                                  d+c

                                                  c

                                                  c

                                                  b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                  Successive Band Reduction (BischofLangSun)

                                                  1

                                                  1

                                                  2

                                                  2

                                                  3

                                                  3

                                                  4

                                                  4

                                                  Q1

                                                  Q1T

                                                  Q2

                                                  Q2T

                                                  Q3

                                                  Q3T

                                                  b+1

                                                  b+1

                                                  d+1

                                                  d+1

                                                  d+c

                                                  d+c

                                                  d+c

                                                  d+c

                                                  c

                                                  c

                                                  b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                  Successive Band Reduction (BischofLangSun)

                                                  1

                                                  1

                                                  2

                                                  2

                                                  3

                                                  3

                                                  4

                                                  4

                                                  5

                                                  5

                                                  Q1

                                                  Q1T

                                                  Q2

                                                  Q2T

                                                  Q3

                                                  Q3T

                                                  Q4

                                                  Q4T

                                                  b+1

                                                  b+1

                                                  d+1

                                                  d+1

                                                  c

                                                  c

                                                  d+c

                                                  d+c

                                                  d+c

                                                  d+c

                                                  b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                  Successive Band Reduction (BischofLangSun)

                                                  1

                                                  1

                                                  2

                                                  2

                                                  3

                                                  3

                                                  4

                                                  4

                                                  5

                                                  5

                                                  Q5T

                                                  Q1

                                                  Q1T

                                                  Q2

                                                  Q2T

                                                  Q3

                                                  Q3T

                                                  Q5

                                                  Q4

                                                  Q4T

                                                  b+1

                                                  b+1

                                                  d+1

                                                  d+1

                                                  c

                                                  c

                                                  d+c

                                                  d+c

                                                  d+c

                                                  d+c

                                                  b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                  Successive Band Reduction (BischofLangSun)

                                                  1

                                                  1

                                                  2

                                                  2

                                                  3

                                                  3

                                                  4

                                                  4

                                                  5

                                                  5

                                                  6

                                                  6

                                                  Q5T

                                                  Q1

                                                  Q1T

                                                  Q2

                                                  Q2T

                                                  Q3

                                                  Q3T

                                                  Q5

                                                  Q4

                                                  Q4T

                                                  b+1

                                                  b+1

                                                  d+1

                                                  d+1

                                                  c

                                                  c

                                                  d+c

                                                  d+c

                                                  d+c

                                                  d+c

                                                  b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                  Successive Band Reduction (BischofLangSun)

                                                  Conventional vs CA - SBR

                                                  Conventional Communication-Avoiding

                                                  Touch all data 4 times Touch all data once

                                                  >
                                                  >

                                                  Speedups of Sym Band Reductionvs DSBTRD

                                                  bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

                                                  bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

                                                  bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

                                                  bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

                                                  bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

                                                  Nonsymmetric Eigenproblem

                                                  bull No apparent way to modify standard algorithmbull Instead Spectral Divide-and-Conquer

                                                  ndash Find orthogonal matrix Q whose leading columns span an invariant subspace of A

                                                  ndash QTAQ will be block upper triangularndash Apply recursively to A11 A22

                                                  ndash Depends on randomization1 Randomized Rank Revealing QR decomposition2 Randomized location to try splitting spectrum

                                                  A11 A12

                                                  ε A22

                                                  Attaining the Lower bounds SequentialLegend[Existing][Ours][Math-Lib][Random]

                                                  Two Levels Memory Hierarchy

                                                  Words Messages Words Messages

                                                  BLAS-3 [FLPRrsquo99][BDLSTrsquo13][MKL etc] [FLPRrsquo99][BDLSTrsquo13][MKL etc]

                                                  Cholesky[Grsquo97][APrsquo00]

                                                  [LAPACK][BDHSrsquo09]

                                                  [Grsquo97][APrsquo00][BDHSrsquo09] [Grsquo97][APrsquo00][BDHSrsquo09]

                                                  Sym Indefinite [BBDDDPSTYrsquo13] [BBDDDPSTYrsquo13]

                                                  LU[Grsquo97][Trsquo97]

                                                  [GDXrsquo11][BDLSTrsquo13]

                                                  [GDXrsquo11][BDLSTrsquo13]

                                                  [Grsquo97][Trsquo97] [BDLSTrsquo13] [BDLSTrsquo13]

                                                  QR[EGrsquo98][FWrsquo03]

                                                  [DGHLrsquo12][BDLSTrsquo13]

                                                  [FWrsquo03][DGHLrsquo12][BDLSTrsquo13]

                                                  [EGrsquo98][FWrsquo03][BDLSTrsquo13]

                                                  [FWrsquo03][BDLSTrsquo13]

                                                  Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

                                                  Sym Eig amp SVD [BDDrsquo11][BDKrsquo13] [BDDrsquo11]

                                                  Non Sym Eig [BDDrsquo11] [BDDrsquo11]

                                                  Legend[Existing][Ours][Math-Lib][Random]

                                                  Words (BW) Messages (L) Saving factor

                                                  BLAS-3 [AGZrsquo94][MTrsquo99][ScaLAPACK][Crsquo69][vGWrsquo97][SDrsquo11] L nP12

                                                  Cholesky [ScaLAPACK][Trsquo99][SDrsquo11] L nP12

                                                  Sym Indefinite [BBDDDPSTYrsquo13][ScaLAPACK] [BBDDDPSTYrsquo13] L nP12

                                                  LU [ScaLAPACK][GDXrsquo11][Trsquo99][SDrsquo11] [GDXrsquo11][Trsquo99][SDrsquo11] L nP12

                                                  QR [ScaLAPACK][DGHLrsquo12] [Trsquo99] [DGHLrsquo12][Trsquo99] L nP12

                                                  Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

                                                  Sym Eig amp SVD [BDDrsquo11][BDKrsquo13][ScaLAPACK] [BDDrsquo11][BDKrsquo13] L nP12

                                                  Non-Sym Eig [BDDrsquo11] [BDDrsquo11] BW P12 L n

                                                  Attaining with extra memory 25D M=(cn2P)

                                                  Attaining the Lower bounds Parallel 2DM=(n2P)(Ignoring poly-log(P) factors words = ( n2 P12) messages = (P12)

                                                  Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                  ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                  ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                  bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                  bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                  Avoiding Communication in Iterative Linear Algebra

                                                  bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

                                                  bull Conjugate Gradients (CG) GMRES Lanczos Arnoldi hellip bull Goal minimize communication

                                                  ndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

                                                  bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

                                                  ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

                                                  bull Lots of speed up possible (modeled and measured)ndash Price some redundant computationndash Challenges Poor partitioning Preconditioning Num Stability

                                                  75

                                                  Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                  ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                  ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                  bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                  bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                  Example The Difficulty of Tuning SpMV

                                                  bull n = 21200bull nnz = 15 M

                                                  bull Source NASA structural analysis problem (raefsky)

                                                  77

                                                  Example The Difficulty of Tuning

                                                  bull n = 21200bull nnz = 15 M

                                                  bull Source NASA structural analysis problem (raefsky)

                                                  bull 8x8 dense substructure exploit this to limit mem_refs

                                                  78

                                                  Speedups on Itanium 2 The Need for Search

                                                  Reference

                                                  Best 4x2

                                                  Mflops

                                                  Mflops

                                                  79

                                                  Register Profile Itanium 2

                                                  190 Mflops

                                                  1190 Mflops

                                                  80

                                                  Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16

                                                  Itanium 2 - 33Itanium 1 - 8

                                                  252 Mflops

                                                  122 Mflops

                                                  820 Mflops

                                                  459 Mflops

                                                  247 Mflops

                                                  107 Mflops

                                                  12 Gflops

                                                  190 Mflops

                                                  Another example of tuning challenges for SpMV

                                                  bull Ex11 matrix (fluid flow)

                                                  bull More complicated non-zero structure in general

                                                  bull N = 16614bull NNZ = 11M

                                                  82

                                                  Zoom in to top corner

                                                  bull More complicated non-zero structure in general

                                                  bull N = 16614bull NNZ = 11M

                                                  83

                                                  3x3 blocks look natural buthellip

                                                  bull Example 3x3 blockingndash Logical grid of 3x3 cells

                                                  bull But would lead to lots of ldquofill-inrdquo

                                                  84

                                                  Extra Work Can Improve Efficiency

                                                  bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

                                                  bull On Pentium III 15x speedupndash Actual mflop rate 152 = 225 higher

                                                  85

                                                  Source Accelerator Cavity Design Problem (Ko via Husbands)

                                                  86

                                                  100x100 Submatrix Along Diagonal

                                                  Summer School Lecture 787

                                                  Post-RCM Reordering

                                                  88

                                                  Effect of Combined RCM+TSP Reordering

                                                  Before Green + RedAfter Green + Blue

                                                  Summer School Lecture 789

                                                  2x speedups on Pentium 4 Power 4 hellip

                                                  Summary of Other Performance Optimizations

                                                  bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

                                                  bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

                                                  bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

                                                  90

                                                  Optimized Sparse Kernel Interface - OSKI

                                                  bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

                                                  bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

                                                  bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

                                                  software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

                                                  91

                                                  Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                  ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                  ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                  bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                  bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                  93

                                                  Example Classical Conjugate Gradient (CG)

                                                  SpMVs and dot products require communication in

                                                  each iteration

                                                  via CA Matrix Powers Kernel

                                                  Global reduction to compute G

                                                  94

                                                  Example CA-Conjugate Gradient

                                                  Local computations within inner loop require

                                                  no communication

                                                  Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                  ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                  ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                  bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                  bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                  96

                                                  Slower convergence due

                                                  to roundoff

                                                  Loss of accuracy due to roundoff

                                                  At s = 16 monomial basis is rank deficient Method breaks down

                                                  Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

                                                  CA-CG (monomial)CG

                                                  machine precision

                                                  97

                                                  Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                  ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                  ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                  bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                  bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                  What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

                                                  bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

                                                  bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

                                                  matrices

                                                  Explicit (O(nnz)) Implicit (o(nnz))

                                                  Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

                                                  Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

                                                  Indices

                                                  Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                  ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                  ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                  bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                  bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                  101

                                                  bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

                                                  ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

                                                  demanded by customers (construction engineers) otherwise they donrsquot believe results

                                                  ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

                                                  ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

                                                  Reproducible Floating Point Computation

                                                  Absolute Error for Random Vectors

                                                  Same magnitude opposite signs

                                                  Intel MKL non-reproducibility

                                                  Relative Error for Orthogonal vectors

                                                  Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

                                                  Sign notreproducible

                                                  103

                                                  bull Consider summation or dot productbull Goals

                                                  1 Same answer independent of layout processors order of summands

                                                  2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

                                                  bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

                                                  GoalsApproaches for Reproducibility

                                                  104

                                                  Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

                                                  Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

                                                  Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

                                                  bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

                                                  Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

                                                  bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

                                                  Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

                                                  Instruments NEC Nokia NVIDIA Samsung Oracle

                                                  bull bebopcsberkeleyedu

                                                  Summary

                                                  Donrsquot Communichellip

                                                  106

                                                  Time to redesign all linear algebra n-body hellip algorithms and software

                                                  (and compilers)

                                                  • Implementing Communication-Avoiding Algorithms
                                                  • Why avoid communication
                                                  • Goals
                                                  • Outline
                                                  • Outline (2)
                                                  • Lower bound for all ldquon3-likerdquo linear algebra
                                                  • Lower bound for all ldquon3-likerdquo linear algebra (2)
                                                  • Lower bound for all ldquon3-likerdquo linear algebra (3)
                                                  • Limits to parallel scaling (12)
                                                  • Limits to parallel scaling (22)
                                                  • Can we attain these lower bounds
                                                  • Outline (3)
                                                  • 25D Matrix Multiplication
                                                  • 25D Matrix Multiplication (2)
                                                  • 25D Matmul on BGP 16K nodes 64K cores (2)
                                                  • Perfect Strong Scaling ndash in Time and Energy (12)
                                                  • Perfect Strong Scaling ndash in Time and Energy (22)
                                                  • Handling Heterogeneity
                                                  • Application to Tensor Contractions
                                                  • C(ijk) = Σm A(ijm)B(mk)
                                                  • Application to Tensor Contractions (2)
                                                  • Communication Lower Bounds for Strassen-like matmul algorithms
                                                  • vs
                                                  • Slide 26
                                                  • Strassen-like beyond matmul
                                                  • Cache and Network Oblivious Algorithms
                                                  • CARMA Performance Distributed Memory
                                                  • CARMA Performance Distributed Memory (2)
                                                  • CARMA Performance Shared Memory
                                                  • CARMA Performance Shared Memory (2)
                                                  • Why is CARMA Faster in Shared Memory
                                                  • Outline (4)
                                                  • One-sided Factorizations (LU QR) so far
                                                  • TSQR An Architecture-Dependent Algorithm
                                                  • Back to LU Using similar idea for TSLU as TSQR Use reduction
                                                  • Minimizing Communication in TSLU
                                                  • Making TSLU Numerically Stable
                                                  • Stability of LU using TSLU CALU
                                                  • Why is stability of TSLU just a ldquoThmrdquo
                                                  • Fixing TSLU
                                                  • 2D CALU with Tournament Pivoting
                                                  • 25D CALU with Tournament Pivoting (c=4 copies)
                                                  • Exascale Machine Parameters Source DOE Exascale Workshop
                                                  • Exascale predicted speedups for Gaussian Elimination 2D CA
                                                  • 25D vs 2D LU With and Without Pivoting
                                                  • Other CA algorithms for Ax=b least squares(13)
                                                  • Other CA algorithms for Ax=b least squares (23)
                                                  • Other CA algorithms for Ax=b least squares (33)
                                                  • Outline (5)
                                                  • What about sparse matrices (13)
                                                  • Performance of 25D APSP using Kleene
                                                  • What about sparse matrices (23)
                                                  • What about sparse matrices (33)
                                                  • Outline (6)
                                                  • Symmetric Eigenproblem and SVD
                                                  • Slide 58
                                                  • Slide 59
                                                  • Slide 60
                                                  • Slide 61
                                                  • Slide 62
                                                  • Slide 63
                                                  • Slide 64
                                                  • Slide 65
                                                  • Slide 66
                                                  • Slide 67
                                                  • Slide 68
                                                  • Conventional vs CA - SBR
                                                  • Speedups of Sym Band Reduction vs DSBTRD
                                                  • Nonsymmetric Eigenproblem
                                                  • Attaining the Lower bounds Sequential
                                                  • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
                                                  • Outline (7)
                                                  • Avoiding Communication in Iterative Linear Algebra
                                                  • Outline (8)
                                                  • Example The Difficulty of Tuning SpMV
                                                  • Example The Difficulty of Tuning
                                                  • Speedups on Itanium 2 The Need for Search
                                                  • Register Profile Itanium 2
                                                  • Register Profiles IBM and Intel IA-64
                                                  • Another example of tuning challenges for SpMV
                                                  • Zoom in to top corner
                                                  • 3x3 blocks look natural buthellip
                                                  • Extra Work Can Improve Efficiency
                                                  • Slide 86
                                                  • Slide 87
                                                  • Slide 88
                                                  • Slide 89
                                                  • Summary of Other Performance Optimizations
                                                  • Optimized Sparse Kernel Interface - OSKI
                                                  • Outline (9)
                                                  • Example Classical Conjugate Gradient (CG)
                                                  • Example CA-Conjugate Gradient
                                                  • Outline (10)
                                                  • Slide 96
                                                  • Slide 97
                                                  • Outline (11)
                                                  • What is a ldquosparse matrixrdquo
                                                  • Outline (12)
                                                  • Reproducible Floating Point Computation
                                                  • Intel MKL non-reproducibility
                                                  • GoalsApproaches for Reproducibility
                                                  • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
                                                  • Collaborators and Supporters
                                                  • Summary

                                                    Cache and Network Oblivious Algorithms

                                                    bull Motivation Minimizes communication at every level of a hierarchical system without tuning parameters (in theory)ndash Not always 25D Matmul on BGP was topology aware

                                                    bull CAPS Divide-and-conquer choose BFS or DFS to adapt to processors available memory

                                                    bull CARMAndash Divide-and-conquer classical matmul divide largest of 3

                                                    dimensions to create two subproblemsndash Choose BFS or DFS to adapt to processors available memory

                                                    CARMA Performance Distributed Memory

                                                    Square m = k = n = 6144

                                                    ScaLAPACK

                                                    CARMA

                                                    Peak

                                                    (log)

                                                    (log)

                                                    Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA

                                                    CARMA Performance Distributed Memory

                                                    Inner Product m = n = 192 k = 6291456

                                                    ScaLAPACK

                                                    CARMAPeak

                                                    (log)

                                                    (log)

                                                    Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA

                                                    CARMA Performance Shared Memory

                                                    Square m = k = n

                                                    MKL (double)CARMA (double)

                                                    MKL (single)CARMA (single)

                                                    Peak (single)

                                                    Peak (double)

                                                    (log)

                                                    (linear)

                                                    Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

                                                    CARMA Performance Shared Memory

                                                    Inner Product m = n = 64

                                                    MKL (double)

                                                    CARMA (double)

                                                    MKL (single)

                                                    CARMA (single)

                                                    (log)

                                                    (linear)

                                                    Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

                                                    Why is CARMA Faster in Shared MemoryL3 Cache Misses

                                                    Shared Memory Inner Product (m = n = 64 k = 524288)

                                                    97 Fewer Misses

                                                    86 Fewer Misses

                                                    (linear)

                                                    Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                    ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                    ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                    bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                    bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                    One-sided Factorizations (LU QR) so farbull Classical Approach for i=1 to n update column i update trailing matrixbull words_moved = O(n3)

                                                    35

                                                    bull Blocked Approach (LAPACK) for i=1 to nb update block i of b columns update trailing matrixbull words moved = O(n3M13)

                                                    bull Recursive Approach func factor(A) if A has 1 column update it

                                                    else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)

                                                    bull None of these approaches minimizes messagesbull Parallel case Partial

                                                    Pivoting =gt n reductionsbull Need another idea

                                                    TSQR An Architecture-Dependent Algorithm

                                                    W =

                                                    W0

                                                    W1

                                                    W2

                                                    W3

                                                    R00

                                                    R10

                                                    R20

                                                    R30

                                                    R01

                                                    R11

                                                    R02Parallel

                                                    W =

                                                    W0

                                                    W1

                                                    W2

                                                    W3

                                                    R01 R02

                                                    R00

                                                    R03

                                                    SequentialStreaming

                                                    W =

                                                    W0

                                                    W1

                                                    W2

                                                    W3

                                                    R00

                                                    R01R01

                                                    R11

                                                    R02

                                                    R11

                                                    R03

                                                    Dual Core

                                                    Can choose reduction tree dynamicallyMulticore Multisocket Multirack Multisite Out-of-core

                                                    Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

                                                    Wnxb =

                                                    W1

                                                    W2

                                                    W3

                                                    W4

                                                    P1middotL1middotU1

                                                    P2middotL2middotU2

                                                    P3middotL3middotU3

                                                    P4middotL4middotU4

                                                    =

                                                    Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

                                                    W1rsquoW2rsquoW3rsquoW4rsquo

                                                    P12middotL12middotU12

                                                    P34middotL34middotU34

                                                    =Choose b pivot rows call them W12rsquo

                                                    Choose b pivot rows call them W34rsquo

                                                    W12rsquoW34rsquo

                                                    = P1234middotL1234middotU1234

                                                    Choose b pivot rows

                                                    Go back to W and use these b pivot rows (move them to top do LU without pivoting)

                                                    37

                                                    Minimizing Communication in TSLU

                                                    W = W1

                                                    W2

                                                    W3

                                                    W4

                                                    LULULULU

                                                    LU

                                                    LULUParallel

                                                    W = W1

                                                    W2

                                                    W3

                                                    W4

                                                    LULU

                                                    LU

                                                    LUSequentialStreaming

                                                    W = W1

                                                    W2

                                                    W3

                                                    W4

                                                    LULU LU

                                                    LULU

                                                    LULU

                                                    Dual Core

                                                    Can choose reduction tree dynamically to match architecture as before

                                                    38

                                                    Making TSLU Numerically Stable

                                                    bull Details matterndash Going up the tree we could do LU either on original rows of A

                                                    (tournament pivoting) or computed rows of Undash Only tournament pivoting stable

                                                    bull ldquoThmrdquo New scheme as stable as Partial Pivoting (GEPP) in following sense Get same Schur complements as GEPP applied to different input matrix whose entries are blocks taken from input A

                                                    bull Why just a ldquoThmrdquo

                                                    39

                                                    Stability of LU using TSLU CALU

                                                    Summer School Lecture 4 40

                                                    bull Empirical testingndash Both random matrices and ldquospecial onesrdquondash Both binary tree (BCALU) and flat-tree (FCALU)ndash 3 metrics ||PA-LU||||A|| normwise and componentwise backward errorsndash See [D Grigori Xiang 2010] for details

                                                    Why is stability of TSLU just a ldquoThmrdquo

                                                    bull Proof is correct ndash in exact arithmeticbull Experiment

                                                    ndash Generate 100 random 6x6 rank 3 matrices in Matlabndash [LUP] = lu(A) do LU without pivoting on PA compare L factors are

                                                    they the samebull Compute || L ndash Lnp || A few 0rsquos A few infinrsquos a few NaNsbull Rest mostly O(1)

                                                    ndash Why Floating point is nonassociative doing arithmetic in different order gives different rounding errors

                                                    ndash Same experiment with rank 6 matrices || L ndash Lnp || usually nonzero O(macheps)

                                                    ndash Same experiment with 20x20 rank 4 matrices || L ndash Lnp || often O(103)

                                                    bull Much harder to break TSLU but possiblendash Occurred when using TSLU to factorize a low-rank subdiagonal

                                                    panel in symmetric-indefinite factorization 41

                                                    Fixing TSLU

                                                    bull Run TSLU quickly test for stability fix if necessary (rare)

                                                    bull Test conditioning of U if not tiny (usual case) proceed elsebull Compute || L || if not big (usual case) proceed elsebull Factor A = QR using TSQR thenbull Factor Q = PLU using TSLU thenbull A = PL(UR) with UR as upper triangular factor

                                                    bull Last topic in lecture how to guarantee floating point reproducibility

                                                    42

                                                    2D CALU with Tournament Pivoting

                                                    43

                                                    25D CALU with Tournament Pivoting (c=4 copies)

                                                    44

                                                    Exascale Machine ParametersSource DOE Exascale Workshop

                                                    bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

                                                    Exascale predicted speedupsfor Gaussian Elimination

                                                    2D CA-LU vs ScaLAPACK-LU

                                                    log2 (p)

                                                    log 2

                                                    (n2 p

                                                    ) =

                                                    log 2

                                                    (mem

                                                    ory_

                                                    per_

                                                    proc

                                                    )

                                                    Up to 29x

                                                    25D vs 2D LUWith and Without Pivoting

                                                    Other CA algorithms for Ax=b least squares(13)

                                                    bull A symmetric and indefinitendash Seek factorization that retains symmetry PAPT = LDLT D

                                                    ldquosimplerdquobull Save frac12 flops preserve inertia

                                                    ndash Usual approach Bunch-Kaufmanbull D block diagonal with 1x1 and 2x2 blocksbull Pivot search down column along row (lots of communication)

                                                    ndash Alternative Aasenbull D = tridiagonal = Tbull Two steps

                                                    ndash PAPT = LTLT where T is banded using TSLU

                                                    48

                                                    0 0

                                                    0

                                                    0 0

                                                    0

                                                    0

                                                    hellip

                                                    hellip

                                                    ndash Solvefactor narrow band problem with Tbull Up to 28x faster than MKL Best Paper at IPDPSrsquo13

                                                    Other CA algorithms for Ax=b least squares (23)bull Minimizing bandwidth and latency for sequential GEPP

                                                    ndash So far could not do partial pivoting and minimize messages just words

                                                    ndash Challengebull Column layout good for choosing pivots bad for matmulbull Blocked layout good for matmul bad for choosing pivots

                                                    ndash Solution use both layouts switching between thembull ldquoShape Morphing LUrdquo or SMLU

                                                    49

                                                    bull func factor(A) if A has 1 column update it else factor(left half of A)

                                                    update right half of A

                                                    factor(right half of A)

                                                    bull Words = O(n3M12)

                                                    bull Messages = O(n3M)

                                                    bull func factor(A) if A has 1 column update it else factor(left half of A) reshape to recursive block format update right half of A reshape to columnwise format factor(right half of A)

                                                    bull Words = O(n3M12)

                                                    bull Messages = O(n3M32)

                                                    Other CA algorithms for Ax=b least squares (33)bull Need for pivoting arises beyond LU in QR

                                                    ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)

                                                    ndash Usual approach like Partial Pivotingbull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound

                                                    ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two

                                                    groups of b columns either using usual approach or something better (GuEisenstat)

                                                    bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix

                                                    ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 50

                                                    Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                    ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                    ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                    bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                    bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                    What about sparse matrices (13)

                                                    bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

                                                    bull But canrsquot reorder outer loop for 25D need another ideabull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = AB

                                                    ndash Dependencies ok 25D works just different semiringbull Kleenersquos Algorithm

                                                    52

                                                    for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

                                                    D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

                                                    Performance of 25D APSP using Kleene

                                                    53

                                                    Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

                                                    62xspeedup

                                                    2x speedup

                                                    What about sparse matrices (23)

                                                    bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of G(A)

                                                    have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

                                                    bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering for

                                                    2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

                                                    bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

                                                    multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

                                                    separators)

                                                    54

                                                    What about sparse matrices (33)

                                                    bull If matrix stays very sparse lower bound unattainable new one bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

                                                    data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

                                                    bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

                                                    Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

                                                    bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

                                                    along dimensions most likely to minimize cost55

                                                    Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                    ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                    ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                    bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                    bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                    Symmetric Eigenproblem and SVD

                                                    bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd

                                                    bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12

                                                    ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea

                                                    b+1

                                                    b+1

                                                    Successive Band Reduction (BischofLangSun)

                                                    1

                                                    b+1

                                                    b+1

                                                    d+1

                                                    c

                                                    Successive Band Reduction (BischofLangSun)

                                                    b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                    1Q1

                                                    b+1

                                                    b+1

                                                    d+1

                                                    c

                                                    b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                    Successive Band Reduction (BischofLangSun)

                                                    12

                                                    Q1

                                                    b+1

                                                    b+1

                                                    d+1

                                                    d+c

                                                    d+c

                                                    c

                                                    b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                    Successive Band Reduction (BischofLangSun)

                                                    1

                                                    12

                                                    Q1

                                                    Q1T

                                                    b+1

                                                    b+1

                                                    d+1

                                                    d+1

                                                    cd+c

                                                    d+c

                                                    c

                                                    b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                    Successive Band Reduction (BischofLangSun)

                                                    1

                                                    1

                                                    2

                                                    2Q1

                                                    Q1T

                                                    b+1

                                                    b+1

                                                    d+1

                                                    d+1

                                                    cd+c

                                                    d+c

                                                    d+c

                                                    d+c

                                                    c

                                                    b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                    Successive Band Reduction (BischofLangSun)

                                                    1

                                                    1

                                                    2

                                                    2

                                                    3

                                                    3

                                                    Q1

                                                    Q1T

                                                    Q2

                                                    Q2T

                                                    b+1

                                                    b+1

                                                    d+1

                                                    d+1

                                                    d+c

                                                    d+c

                                                    d+c

                                                    d+c

                                                    c

                                                    c

                                                    b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                    Successive Band Reduction (BischofLangSun)

                                                    1

                                                    1

                                                    2

                                                    2

                                                    3

                                                    3

                                                    4

                                                    4

                                                    Q1

                                                    Q1T

                                                    Q2

                                                    Q2T

                                                    Q3

                                                    Q3T

                                                    b+1

                                                    b+1

                                                    d+1

                                                    d+1

                                                    d+c

                                                    d+c

                                                    d+c

                                                    d+c

                                                    c

                                                    c

                                                    b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                    Successive Band Reduction (BischofLangSun)

                                                    1

                                                    1

                                                    2

                                                    2

                                                    3

                                                    3

                                                    4

                                                    4

                                                    5

                                                    5

                                                    Q1

                                                    Q1T

                                                    Q2

                                                    Q2T

                                                    Q3

                                                    Q3T

                                                    Q4

                                                    Q4T

                                                    b+1

                                                    b+1

                                                    d+1

                                                    d+1

                                                    c

                                                    c

                                                    d+c

                                                    d+c

                                                    d+c

                                                    d+c

                                                    b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                    Successive Band Reduction (BischofLangSun)

                                                    1

                                                    1

                                                    2

                                                    2

                                                    3

                                                    3

                                                    4

                                                    4

                                                    5

                                                    5

                                                    Q5T

                                                    Q1

                                                    Q1T

                                                    Q2

                                                    Q2T

                                                    Q3

                                                    Q3T

                                                    Q5

                                                    Q4

                                                    Q4T

                                                    b+1

                                                    b+1

                                                    d+1

                                                    d+1

                                                    c

                                                    c

                                                    d+c

                                                    d+c

                                                    d+c

                                                    d+c

                                                    b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                    Successive Band Reduction (BischofLangSun)

                                                    1

                                                    1

                                                    2

                                                    2

                                                    3

                                                    3

                                                    4

                                                    4

                                                    5

                                                    5

                                                    6

                                                    6

                                                    Q5T

                                                    Q1

                                                    Q1T

                                                    Q2

                                                    Q2T

                                                    Q3

                                                    Q3T

                                                    Q5

                                                    Q4

                                                    Q4T

                                                    b+1

                                                    b+1

                                                    d+1

                                                    d+1

                                                    c

                                                    c

                                                    d+c

                                                    d+c

                                                    d+c

                                                    d+c

                                                    b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                    Successive Band Reduction (BischofLangSun)

                                                    Conventional vs CA - SBR

                                                    Conventional Communication-Avoiding

                                                    Touch all data 4 times Touch all data once

                                                    >
                                                    >

                                                    Speedups of Sym Band Reductionvs DSBTRD

                                                    bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

                                                    bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

                                                    bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

                                                    bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

                                                    bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

                                                    Nonsymmetric Eigenproblem

                                                    bull No apparent way to modify standard algorithmbull Instead Spectral Divide-and-Conquer

                                                    ndash Find orthogonal matrix Q whose leading columns span an invariant subspace of A

                                                    ndash QTAQ will be block upper triangularndash Apply recursively to A11 A22

                                                    ndash Depends on randomization1 Randomized Rank Revealing QR decomposition2 Randomized location to try splitting spectrum

                                                    A11 A12

                                                    ε A22

                                                    Attaining the Lower bounds SequentialLegend[Existing][Ours][Math-Lib][Random]

                                                    Two Levels Memory Hierarchy

                                                    Words Messages Words Messages

                                                    BLAS-3 [FLPRrsquo99][BDLSTrsquo13][MKL etc] [FLPRrsquo99][BDLSTrsquo13][MKL etc]

                                                    Cholesky[Grsquo97][APrsquo00]

                                                    [LAPACK][BDHSrsquo09]

                                                    [Grsquo97][APrsquo00][BDHSrsquo09] [Grsquo97][APrsquo00][BDHSrsquo09]

                                                    Sym Indefinite [BBDDDPSTYrsquo13] [BBDDDPSTYrsquo13]

                                                    LU[Grsquo97][Trsquo97]

                                                    [GDXrsquo11][BDLSTrsquo13]

                                                    [GDXrsquo11][BDLSTrsquo13]

                                                    [Grsquo97][Trsquo97] [BDLSTrsquo13] [BDLSTrsquo13]

                                                    QR[EGrsquo98][FWrsquo03]

                                                    [DGHLrsquo12][BDLSTrsquo13]

                                                    [FWrsquo03][DGHLrsquo12][BDLSTrsquo13]

                                                    [EGrsquo98][FWrsquo03][BDLSTrsquo13]

                                                    [FWrsquo03][BDLSTrsquo13]

                                                    Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

                                                    Sym Eig amp SVD [BDDrsquo11][BDKrsquo13] [BDDrsquo11]

                                                    Non Sym Eig [BDDrsquo11] [BDDrsquo11]

                                                    Legend[Existing][Ours][Math-Lib][Random]

                                                    Words (BW) Messages (L) Saving factor

                                                    BLAS-3 [AGZrsquo94][MTrsquo99][ScaLAPACK][Crsquo69][vGWrsquo97][SDrsquo11] L nP12

                                                    Cholesky [ScaLAPACK][Trsquo99][SDrsquo11] L nP12

                                                    Sym Indefinite [BBDDDPSTYrsquo13][ScaLAPACK] [BBDDDPSTYrsquo13] L nP12

                                                    LU [ScaLAPACK][GDXrsquo11][Trsquo99][SDrsquo11] [GDXrsquo11][Trsquo99][SDrsquo11] L nP12

                                                    QR [ScaLAPACK][DGHLrsquo12] [Trsquo99] [DGHLrsquo12][Trsquo99] L nP12

                                                    Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

                                                    Sym Eig amp SVD [BDDrsquo11][BDKrsquo13][ScaLAPACK] [BDDrsquo11][BDKrsquo13] L nP12

                                                    Non-Sym Eig [BDDrsquo11] [BDDrsquo11] BW P12 L n

                                                    Attaining with extra memory 25D M=(cn2P)

                                                    Attaining the Lower bounds Parallel 2DM=(n2P)(Ignoring poly-log(P) factors words = ( n2 P12) messages = (P12)

                                                    Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                    ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                    ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                    bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                    bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                    Avoiding Communication in Iterative Linear Algebra

                                                    bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

                                                    bull Conjugate Gradients (CG) GMRES Lanczos Arnoldi hellip bull Goal minimize communication

                                                    ndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

                                                    bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

                                                    ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

                                                    bull Lots of speed up possible (modeled and measured)ndash Price some redundant computationndash Challenges Poor partitioning Preconditioning Num Stability

                                                    75

                                                    Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                    ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                    ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                    bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                    bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                    Example The Difficulty of Tuning SpMV

                                                    bull n = 21200bull nnz = 15 M

                                                    bull Source NASA structural analysis problem (raefsky)

                                                    77

                                                    Example The Difficulty of Tuning

                                                    bull n = 21200bull nnz = 15 M

                                                    bull Source NASA structural analysis problem (raefsky)

                                                    bull 8x8 dense substructure exploit this to limit mem_refs

                                                    78

                                                    Speedups on Itanium 2 The Need for Search

                                                    Reference

                                                    Best 4x2

                                                    Mflops

                                                    Mflops

                                                    79

                                                    Register Profile Itanium 2

                                                    190 Mflops

                                                    1190 Mflops

                                                    80

                                                    Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16

                                                    Itanium 2 - 33Itanium 1 - 8

                                                    252 Mflops

                                                    122 Mflops

                                                    820 Mflops

                                                    459 Mflops

                                                    247 Mflops

                                                    107 Mflops

                                                    12 Gflops

                                                    190 Mflops

                                                    Another example of tuning challenges for SpMV

                                                    bull Ex11 matrix (fluid flow)

                                                    bull More complicated non-zero structure in general

                                                    bull N = 16614bull NNZ = 11M

                                                    82

                                                    Zoom in to top corner

                                                    bull More complicated non-zero structure in general

                                                    bull N = 16614bull NNZ = 11M

                                                    83

                                                    3x3 blocks look natural buthellip

                                                    bull Example 3x3 blockingndash Logical grid of 3x3 cells

                                                    bull But would lead to lots of ldquofill-inrdquo

                                                    84

                                                    Extra Work Can Improve Efficiency

                                                    bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

                                                    bull On Pentium III 15x speedupndash Actual mflop rate 152 = 225 higher

                                                    85

                                                    Source Accelerator Cavity Design Problem (Ko via Husbands)

                                                    86

                                                    100x100 Submatrix Along Diagonal

                                                    Summer School Lecture 787

                                                    Post-RCM Reordering

                                                    88

                                                    Effect of Combined RCM+TSP Reordering

                                                    Before Green + RedAfter Green + Blue

                                                    Summer School Lecture 789

                                                    2x speedups on Pentium 4 Power 4 hellip

                                                    Summary of Other Performance Optimizations

                                                    bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

                                                    bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

                                                    bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

                                                    90

                                                    Optimized Sparse Kernel Interface - OSKI

                                                    bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

                                                    bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

                                                    bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

                                                    software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

                                                    91

                                                    Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                    ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                    ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                    bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                    bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                    93

                                                    Example Classical Conjugate Gradient (CG)

                                                    SpMVs and dot products require communication in

                                                    each iteration

                                                    via CA Matrix Powers Kernel

                                                    Global reduction to compute G

                                                    94

                                                    Example CA-Conjugate Gradient

                                                    Local computations within inner loop require

                                                    no communication

                                                    Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                    ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                    ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                    bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                    bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                    96

                                                    Slower convergence due

                                                    to roundoff

                                                    Loss of accuracy due to roundoff

                                                    At s = 16 monomial basis is rank deficient Method breaks down

                                                    Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

                                                    CA-CG (monomial)CG

                                                    machine precision

                                                    97

                                                    Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                    ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                    ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                    bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                    bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                    What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

                                                    bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

                                                    bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

                                                    matrices

                                                    Explicit (O(nnz)) Implicit (o(nnz))

                                                    Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

                                                    Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

                                                    Indices

                                                    Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                    ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                    ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                    bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                    bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                    101

                                                    bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

                                                    ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

                                                    demanded by customers (construction engineers) otherwise they donrsquot believe results

                                                    ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

                                                    ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

                                                    Reproducible Floating Point Computation

                                                    Absolute Error for Random Vectors

                                                    Same magnitude opposite signs

                                                    Intel MKL non-reproducibility

                                                    Relative Error for Orthogonal vectors

                                                    Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

                                                    Sign notreproducible

                                                    103

                                                    bull Consider summation or dot productbull Goals

                                                    1 Same answer independent of layout processors order of summands

                                                    2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

                                                    bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

                                                    GoalsApproaches for Reproducibility

                                                    104

                                                    Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

                                                    Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

                                                    Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

                                                    bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

                                                    Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

                                                    bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

                                                    Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

                                                    Instruments NEC Nokia NVIDIA Samsung Oracle

                                                    bull bebopcsberkeleyedu

                                                    Summary

                                                    Donrsquot Communichellip

                                                    106

                                                    Time to redesign all linear algebra n-body hellip algorithms and software

                                                    (and compilers)

                                                    • Implementing Communication-Avoiding Algorithms
                                                    • Why avoid communication
                                                    • Goals
                                                    • Outline
                                                    • Outline (2)
                                                    • Lower bound for all ldquon3-likerdquo linear algebra
                                                    • Lower bound for all ldquon3-likerdquo linear algebra (2)
                                                    • Lower bound for all ldquon3-likerdquo linear algebra (3)
                                                    • Limits to parallel scaling (12)
                                                    • Limits to parallel scaling (22)
                                                    • Can we attain these lower bounds
                                                    • Outline (3)
                                                    • 25D Matrix Multiplication
                                                    • 25D Matrix Multiplication (2)
                                                    • 25D Matmul on BGP 16K nodes 64K cores (2)
                                                    • Perfect Strong Scaling ndash in Time and Energy (12)
                                                    • Perfect Strong Scaling ndash in Time and Energy (22)
                                                    • Handling Heterogeneity
                                                    • Application to Tensor Contractions
                                                    • C(ijk) = Σm A(ijm)B(mk)
                                                    • Application to Tensor Contractions (2)
                                                    • Communication Lower Bounds for Strassen-like matmul algorithms
                                                    • vs
                                                    • Slide 26
                                                    • Strassen-like beyond matmul
                                                    • Cache and Network Oblivious Algorithms
                                                    • CARMA Performance Distributed Memory
                                                    • CARMA Performance Distributed Memory (2)
                                                    • CARMA Performance Shared Memory
                                                    • CARMA Performance Shared Memory (2)
                                                    • Why is CARMA Faster in Shared Memory
                                                    • Outline (4)
                                                    • One-sided Factorizations (LU QR) so far
                                                    • TSQR An Architecture-Dependent Algorithm
                                                    • Back to LU Using similar idea for TSLU as TSQR Use reduction
                                                    • Minimizing Communication in TSLU
                                                    • Making TSLU Numerically Stable
                                                    • Stability of LU using TSLU CALU
                                                    • Why is stability of TSLU just a ldquoThmrdquo
                                                    • Fixing TSLU
                                                    • 2D CALU with Tournament Pivoting
                                                    • 25D CALU with Tournament Pivoting (c=4 copies)
                                                    • Exascale Machine Parameters Source DOE Exascale Workshop
                                                    • Exascale predicted speedups for Gaussian Elimination 2D CA
                                                    • 25D vs 2D LU With and Without Pivoting
                                                    • Other CA algorithms for Ax=b least squares(13)
                                                    • Other CA algorithms for Ax=b least squares (23)
                                                    • Other CA algorithms for Ax=b least squares (33)
                                                    • Outline (5)
                                                    • What about sparse matrices (13)
                                                    • Performance of 25D APSP using Kleene
                                                    • What about sparse matrices (23)
                                                    • What about sparse matrices (33)
                                                    • Outline (6)
                                                    • Symmetric Eigenproblem and SVD
                                                    • Slide 58
                                                    • Slide 59
                                                    • Slide 60
                                                    • Slide 61
                                                    • Slide 62
                                                    • Slide 63
                                                    • Slide 64
                                                    • Slide 65
                                                    • Slide 66
                                                    • Slide 67
                                                    • Slide 68
                                                    • Conventional vs CA - SBR
                                                    • Speedups of Sym Band Reduction vs DSBTRD
                                                    • Nonsymmetric Eigenproblem
                                                    • Attaining the Lower bounds Sequential
                                                    • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
                                                    • Outline (7)
                                                    • Avoiding Communication in Iterative Linear Algebra
                                                    • Outline (8)
                                                    • Example The Difficulty of Tuning SpMV
                                                    • Example The Difficulty of Tuning
                                                    • Speedups on Itanium 2 The Need for Search
                                                    • Register Profile Itanium 2
                                                    • Register Profiles IBM and Intel IA-64
                                                    • Another example of tuning challenges for SpMV
                                                    • Zoom in to top corner
                                                    • 3x3 blocks look natural buthellip
                                                    • Extra Work Can Improve Efficiency
                                                    • Slide 86
                                                    • Slide 87
                                                    • Slide 88
                                                    • Slide 89
                                                    • Summary of Other Performance Optimizations
                                                    • Optimized Sparse Kernel Interface - OSKI
                                                    • Outline (9)
                                                    • Example Classical Conjugate Gradient (CG)
                                                    • Example CA-Conjugate Gradient
                                                    • Outline (10)
                                                    • Slide 96
                                                    • Slide 97
                                                    • Outline (11)
                                                    • What is a ldquosparse matrixrdquo
                                                    • Outline (12)
                                                    • Reproducible Floating Point Computation
                                                    • Intel MKL non-reproducibility
                                                    • GoalsApproaches for Reproducibility
                                                    • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
                                                    • Collaborators and Supporters
                                                    • Summary

                                                      CARMA Performance Distributed Memory

                                                      Square m = k = n = 6144

                                                      ScaLAPACK

                                                      CARMA

                                                      Peak

                                                      (log)

                                                      (log)

                                                      Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA

                                                      CARMA Performance Distributed Memory

                                                      Inner Product m = n = 192 k = 6291456

                                                      ScaLAPACK

                                                      CARMAPeak

                                                      (log)

                                                      (log)

                                                      Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA

                                                      CARMA Performance Shared Memory

                                                      Square m = k = n

                                                      MKL (double)CARMA (double)

                                                      MKL (single)CARMA (single)

                                                      Peak (single)

                                                      Peak (double)

                                                      (log)

                                                      (linear)

                                                      Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

                                                      CARMA Performance Shared Memory

                                                      Inner Product m = n = 64

                                                      MKL (double)

                                                      CARMA (double)

                                                      MKL (single)

                                                      CARMA (single)

                                                      (log)

                                                      (linear)

                                                      Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

                                                      Why is CARMA Faster in Shared MemoryL3 Cache Misses

                                                      Shared Memory Inner Product (m = n = 64 k = 524288)

                                                      97 Fewer Misses

                                                      86 Fewer Misses

                                                      (linear)

                                                      Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                      ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                      ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                      bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                      bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                      One-sided Factorizations (LU QR) so farbull Classical Approach for i=1 to n update column i update trailing matrixbull words_moved = O(n3)

                                                      35

                                                      bull Blocked Approach (LAPACK) for i=1 to nb update block i of b columns update trailing matrixbull words moved = O(n3M13)

                                                      bull Recursive Approach func factor(A) if A has 1 column update it

                                                      else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)

                                                      bull None of these approaches minimizes messagesbull Parallel case Partial

                                                      Pivoting =gt n reductionsbull Need another idea

                                                      TSQR An Architecture-Dependent Algorithm

                                                      W =

                                                      W0

                                                      W1

                                                      W2

                                                      W3

                                                      R00

                                                      R10

                                                      R20

                                                      R30

                                                      R01

                                                      R11

                                                      R02Parallel

                                                      W =

                                                      W0

                                                      W1

                                                      W2

                                                      W3

                                                      R01 R02

                                                      R00

                                                      R03

                                                      SequentialStreaming

                                                      W =

                                                      W0

                                                      W1

                                                      W2

                                                      W3

                                                      R00

                                                      R01R01

                                                      R11

                                                      R02

                                                      R11

                                                      R03

                                                      Dual Core

                                                      Can choose reduction tree dynamicallyMulticore Multisocket Multirack Multisite Out-of-core

                                                      Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

                                                      Wnxb =

                                                      W1

                                                      W2

                                                      W3

                                                      W4

                                                      P1middotL1middotU1

                                                      P2middotL2middotU2

                                                      P3middotL3middotU3

                                                      P4middotL4middotU4

                                                      =

                                                      Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

                                                      W1rsquoW2rsquoW3rsquoW4rsquo

                                                      P12middotL12middotU12

                                                      P34middotL34middotU34

                                                      =Choose b pivot rows call them W12rsquo

                                                      Choose b pivot rows call them W34rsquo

                                                      W12rsquoW34rsquo

                                                      = P1234middotL1234middotU1234

                                                      Choose b pivot rows

                                                      Go back to W and use these b pivot rows (move them to top do LU without pivoting)

                                                      37

                                                      Minimizing Communication in TSLU

                                                      W = W1

                                                      W2

                                                      W3

                                                      W4

                                                      LULULULU

                                                      LU

                                                      LULUParallel

                                                      W = W1

                                                      W2

                                                      W3

                                                      W4

                                                      LULU

                                                      LU

                                                      LUSequentialStreaming

                                                      W = W1

                                                      W2

                                                      W3

                                                      W4

                                                      LULU LU

                                                      LULU

                                                      LULU

                                                      Dual Core

                                                      Can choose reduction tree dynamically to match architecture as before

                                                      38

                                                      Making TSLU Numerically Stable

                                                      bull Details matterndash Going up the tree we could do LU either on original rows of A

                                                      (tournament pivoting) or computed rows of Undash Only tournament pivoting stable

                                                      bull ldquoThmrdquo New scheme as stable as Partial Pivoting (GEPP) in following sense Get same Schur complements as GEPP applied to different input matrix whose entries are blocks taken from input A

                                                      bull Why just a ldquoThmrdquo

                                                      39

                                                      Stability of LU using TSLU CALU

                                                      Summer School Lecture 4 40

                                                      bull Empirical testingndash Both random matrices and ldquospecial onesrdquondash Both binary tree (BCALU) and flat-tree (FCALU)ndash 3 metrics ||PA-LU||||A|| normwise and componentwise backward errorsndash See [D Grigori Xiang 2010] for details

                                                      Why is stability of TSLU just a ldquoThmrdquo

                                                      bull Proof is correct ndash in exact arithmeticbull Experiment

                                                      ndash Generate 100 random 6x6 rank 3 matrices in Matlabndash [LUP] = lu(A) do LU without pivoting on PA compare L factors are

                                                      they the samebull Compute || L ndash Lnp || A few 0rsquos A few infinrsquos a few NaNsbull Rest mostly O(1)

                                                      ndash Why Floating point is nonassociative doing arithmetic in different order gives different rounding errors

                                                      ndash Same experiment with rank 6 matrices || L ndash Lnp || usually nonzero O(macheps)

                                                      ndash Same experiment with 20x20 rank 4 matrices || L ndash Lnp || often O(103)

                                                      bull Much harder to break TSLU but possiblendash Occurred when using TSLU to factorize a low-rank subdiagonal

                                                      panel in symmetric-indefinite factorization 41

                                                      Fixing TSLU

                                                      bull Run TSLU quickly test for stability fix if necessary (rare)

                                                      bull Test conditioning of U if not tiny (usual case) proceed elsebull Compute || L || if not big (usual case) proceed elsebull Factor A = QR using TSQR thenbull Factor Q = PLU using TSLU thenbull A = PL(UR) with UR as upper triangular factor

                                                      bull Last topic in lecture how to guarantee floating point reproducibility

                                                      42

                                                      2D CALU with Tournament Pivoting

                                                      43

                                                      25D CALU with Tournament Pivoting (c=4 copies)

                                                      44

                                                      Exascale Machine ParametersSource DOE Exascale Workshop

                                                      bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

                                                      Exascale predicted speedupsfor Gaussian Elimination

                                                      2D CA-LU vs ScaLAPACK-LU

                                                      log2 (p)

                                                      log 2

                                                      (n2 p

                                                      ) =

                                                      log 2

                                                      (mem

                                                      ory_

                                                      per_

                                                      proc

                                                      )

                                                      Up to 29x

                                                      25D vs 2D LUWith and Without Pivoting

                                                      Other CA algorithms for Ax=b least squares(13)

                                                      bull A symmetric and indefinitendash Seek factorization that retains symmetry PAPT = LDLT D

                                                      ldquosimplerdquobull Save frac12 flops preserve inertia

                                                      ndash Usual approach Bunch-Kaufmanbull D block diagonal with 1x1 and 2x2 blocksbull Pivot search down column along row (lots of communication)

                                                      ndash Alternative Aasenbull D = tridiagonal = Tbull Two steps

                                                      ndash PAPT = LTLT where T is banded using TSLU

                                                      48

                                                      0 0

                                                      0

                                                      0 0

                                                      0

                                                      0

                                                      hellip

                                                      hellip

                                                      ndash Solvefactor narrow band problem with Tbull Up to 28x faster than MKL Best Paper at IPDPSrsquo13

                                                      Other CA algorithms for Ax=b least squares (23)bull Minimizing bandwidth and latency for sequential GEPP

                                                      ndash So far could not do partial pivoting and minimize messages just words

                                                      ndash Challengebull Column layout good for choosing pivots bad for matmulbull Blocked layout good for matmul bad for choosing pivots

                                                      ndash Solution use both layouts switching between thembull ldquoShape Morphing LUrdquo or SMLU

                                                      49

                                                      bull func factor(A) if A has 1 column update it else factor(left half of A)

                                                      update right half of A

                                                      factor(right half of A)

                                                      bull Words = O(n3M12)

                                                      bull Messages = O(n3M)

                                                      bull func factor(A) if A has 1 column update it else factor(left half of A) reshape to recursive block format update right half of A reshape to columnwise format factor(right half of A)

                                                      bull Words = O(n3M12)

                                                      bull Messages = O(n3M32)

                                                      Other CA algorithms for Ax=b least squares (33)bull Need for pivoting arises beyond LU in QR

                                                      ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)

                                                      ndash Usual approach like Partial Pivotingbull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound

                                                      ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two

                                                      groups of b columns either using usual approach or something better (GuEisenstat)

                                                      bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix

                                                      ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 50

                                                      Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                      ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                      ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                      bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                      bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                      What about sparse matrices (13)

                                                      bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

                                                      bull But canrsquot reorder outer loop for 25D need another ideabull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = AB

                                                      ndash Dependencies ok 25D works just different semiringbull Kleenersquos Algorithm

                                                      52

                                                      for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

                                                      D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

                                                      Performance of 25D APSP using Kleene

                                                      53

                                                      Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

                                                      62xspeedup

                                                      2x speedup

                                                      What about sparse matrices (23)

                                                      bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of G(A)

                                                      have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

                                                      bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering for

                                                      2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

                                                      bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

                                                      multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

                                                      separators)

                                                      54

                                                      What about sparse matrices (33)

                                                      bull If matrix stays very sparse lower bound unattainable new one bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

                                                      data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

                                                      bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

                                                      Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

                                                      bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

                                                      along dimensions most likely to minimize cost55

                                                      Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                      ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                      ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                      bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                      bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                      Symmetric Eigenproblem and SVD

                                                      bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd

                                                      bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12

                                                      ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea

                                                      b+1

                                                      b+1

                                                      Successive Band Reduction (BischofLangSun)

                                                      1

                                                      b+1

                                                      b+1

                                                      d+1

                                                      c

                                                      Successive Band Reduction (BischofLangSun)

                                                      b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                      1Q1

                                                      b+1

                                                      b+1

                                                      d+1

                                                      c

                                                      b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                      Successive Band Reduction (BischofLangSun)

                                                      12

                                                      Q1

                                                      b+1

                                                      b+1

                                                      d+1

                                                      d+c

                                                      d+c

                                                      c

                                                      b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                      Successive Band Reduction (BischofLangSun)

                                                      1

                                                      12

                                                      Q1

                                                      Q1T

                                                      b+1

                                                      b+1

                                                      d+1

                                                      d+1

                                                      cd+c

                                                      d+c

                                                      c

                                                      b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                      Successive Band Reduction (BischofLangSun)

                                                      1

                                                      1

                                                      2

                                                      2Q1

                                                      Q1T

                                                      b+1

                                                      b+1

                                                      d+1

                                                      d+1

                                                      cd+c

                                                      d+c

                                                      d+c

                                                      d+c

                                                      c

                                                      b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                      Successive Band Reduction (BischofLangSun)

                                                      1

                                                      1

                                                      2

                                                      2

                                                      3

                                                      3

                                                      Q1

                                                      Q1T

                                                      Q2

                                                      Q2T

                                                      b+1

                                                      b+1

                                                      d+1

                                                      d+1

                                                      d+c

                                                      d+c

                                                      d+c

                                                      d+c

                                                      c

                                                      c

                                                      b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                      Successive Band Reduction (BischofLangSun)

                                                      1

                                                      1

                                                      2

                                                      2

                                                      3

                                                      3

                                                      4

                                                      4

                                                      Q1

                                                      Q1T

                                                      Q2

                                                      Q2T

                                                      Q3

                                                      Q3T

                                                      b+1

                                                      b+1

                                                      d+1

                                                      d+1

                                                      d+c

                                                      d+c

                                                      d+c

                                                      d+c

                                                      c

                                                      c

                                                      b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                      Successive Band Reduction (BischofLangSun)

                                                      1

                                                      1

                                                      2

                                                      2

                                                      3

                                                      3

                                                      4

                                                      4

                                                      5

                                                      5

                                                      Q1

                                                      Q1T

                                                      Q2

                                                      Q2T

                                                      Q3

                                                      Q3T

                                                      Q4

                                                      Q4T

                                                      b+1

                                                      b+1

                                                      d+1

                                                      d+1

                                                      c

                                                      c

                                                      d+c

                                                      d+c

                                                      d+c

                                                      d+c

                                                      b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                      Successive Band Reduction (BischofLangSun)

                                                      1

                                                      1

                                                      2

                                                      2

                                                      3

                                                      3

                                                      4

                                                      4

                                                      5

                                                      5

                                                      Q5T

                                                      Q1

                                                      Q1T

                                                      Q2

                                                      Q2T

                                                      Q3

                                                      Q3T

                                                      Q5

                                                      Q4

                                                      Q4T

                                                      b+1

                                                      b+1

                                                      d+1

                                                      d+1

                                                      c

                                                      c

                                                      d+c

                                                      d+c

                                                      d+c

                                                      d+c

                                                      b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                      Successive Band Reduction (BischofLangSun)

                                                      1

                                                      1

                                                      2

                                                      2

                                                      3

                                                      3

                                                      4

                                                      4

                                                      5

                                                      5

                                                      6

                                                      6

                                                      Q5T

                                                      Q1

                                                      Q1T

                                                      Q2

                                                      Q2T

                                                      Q3

                                                      Q3T

                                                      Q5

                                                      Q4

                                                      Q4T

                                                      b+1

                                                      b+1

                                                      d+1

                                                      d+1

                                                      c

                                                      c

                                                      d+c

                                                      d+c

                                                      d+c

                                                      d+c

                                                      b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                      Successive Band Reduction (BischofLangSun)

                                                      Conventional vs CA - SBR

                                                      Conventional Communication-Avoiding

                                                      Touch all data 4 times Touch all data once

                                                      >
                                                      >

                                                      Speedups of Sym Band Reductionvs DSBTRD

                                                      bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

                                                      bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

                                                      bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

                                                      bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

                                                      bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

                                                      Nonsymmetric Eigenproblem

                                                      bull No apparent way to modify standard algorithmbull Instead Spectral Divide-and-Conquer

                                                      ndash Find orthogonal matrix Q whose leading columns span an invariant subspace of A

                                                      ndash QTAQ will be block upper triangularndash Apply recursively to A11 A22

                                                      ndash Depends on randomization1 Randomized Rank Revealing QR decomposition2 Randomized location to try splitting spectrum

                                                      A11 A12

                                                      ε A22

                                                      Attaining the Lower bounds SequentialLegend[Existing][Ours][Math-Lib][Random]

                                                      Two Levels Memory Hierarchy

                                                      Words Messages Words Messages

                                                      BLAS-3 [FLPRrsquo99][BDLSTrsquo13][MKL etc] [FLPRrsquo99][BDLSTrsquo13][MKL etc]

                                                      Cholesky[Grsquo97][APrsquo00]

                                                      [LAPACK][BDHSrsquo09]

                                                      [Grsquo97][APrsquo00][BDHSrsquo09] [Grsquo97][APrsquo00][BDHSrsquo09]

                                                      Sym Indefinite [BBDDDPSTYrsquo13] [BBDDDPSTYrsquo13]

                                                      LU[Grsquo97][Trsquo97]

                                                      [GDXrsquo11][BDLSTrsquo13]

                                                      [GDXrsquo11][BDLSTrsquo13]

                                                      [Grsquo97][Trsquo97] [BDLSTrsquo13] [BDLSTrsquo13]

                                                      QR[EGrsquo98][FWrsquo03]

                                                      [DGHLrsquo12][BDLSTrsquo13]

                                                      [FWrsquo03][DGHLrsquo12][BDLSTrsquo13]

                                                      [EGrsquo98][FWrsquo03][BDLSTrsquo13]

                                                      [FWrsquo03][BDLSTrsquo13]

                                                      Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

                                                      Sym Eig amp SVD [BDDrsquo11][BDKrsquo13] [BDDrsquo11]

                                                      Non Sym Eig [BDDrsquo11] [BDDrsquo11]

                                                      Legend[Existing][Ours][Math-Lib][Random]

                                                      Words (BW) Messages (L) Saving factor

                                                      BLAS-3 [AGZrsquo94][MTrsquo99][ScaLAPACK][Crsquo69][vGWrsquo97][SDrsquo11] L nP12

                                                      Cholesky [ScaLAPACK][Trsquo99][SDrsquo11] L nP12

                                                      Sym Indefinite [BBDDDPSTYrsquo13][ScaLAPACK] [BBDDDPSTYrsquo13] L nP12

                                                      LU [ScaLAPACK][GDXrsquo11][Trsquo99][SDrsquo11] [GDXrsquo11][Trsquo99][SDrsquo11] L nP12

                                                      QR [ScaLAPACK][DGHLrsquo12] [Trsquo99] [DGHLrsquo12][Trsquo99] L nP12

                                                      Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

                                                      Sym Eig amp SVD [BDDrsquo11][BDKrsquo13][ScaLAPACK] [BDDrsquo11][BDKrsquo13] L nP12

                                                      Non-Sym Eig [BDDrsquo11] [BDDrsquo11] BW P12 L n

                                                      Attaining with extra memory 25D M=(cn2P)

                                                      Attaining the Lower bounds Parallel 2DM=(n2P)(Ignoring poly-log(P) factors words = ( n2 P12) messages = (P12)

                                                      Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                      ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                      ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                      bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                      bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                      Avoiding Communication in Iterative Linear Algebra

                                                      bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

                                                      bull Conjugate Gradients (CG) GMRES Lanczos Arnoldi hellip bull Goal minimize communication

                                                      ndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

                                                      bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

                                                      ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

                                                      bull Lots of speed up possible (modeled and measured)ndash Price some redundant computationndash Challenges Poor partitioning Preconditioning Num Stability

                                                      75

                                                      Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                      ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                      ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                      bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                      bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                      Example The Difficulty of Tuning SpMV

                                                      bull n = 21200bull nnz = 15 M

                                                      bull Source NASA structural analysis problem (raefsky)

                                                      77

                                                      Example The Difficulty of Tuning

                                                      bull n = 21200bull nnz = 15 M

                                                      bull Source NASA structural analysis problem (raefsky)

                                                      bull 8x8 dense substructure exploit this to limit mem_refs

                                                      78

                                                      Speedups on Itanium 2 The Need for Search

                                                      Reference

                                                      Best 4x2

                                                      Mflops

                                                      Mflops

                                                      79

                                                      Register Profile Itanium 2

                                                      190 Mflops

                                                      1190 Mflops

                                                      80

                                                      Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16

                                                      Itanium 2 - 33Itanium 1 - 8

                                                      252 Mflops

                                                      122 Mflops

                                                      820 Mflops

                                                      459 Mflops

                                                      247 Mflops

                                                      107 Mflops

                                                      12 Gflops

                                                      190 Mflops

                                                      Another example of tuning challenges for SpMV

                                                      bull Ex11 matrix (fluid flow)

                                                      bull More complicated non-zero structure in general

                                                      bull N = 16614bull NNZ = 11M

                                                      82

                                                      Zoom in to top corner

                                                      bull More complicated non-zero structure in general

                                                      bull N = 16614bull NNZ = 11M

                                                      83

                                                      3x3 blocks look natural buthellip

                                                      bull Example 3x3 blockingndash Logical grid of 3x3 cells

                                                      bull But would lead to lots of ldquofill-inrdquo

                                                      84

                                                      Extra Work Can Improve Efficiency

                                                      bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

                                                      bull On Pentium III 15x speedupndash Actual mflop rate 152 = 225 higher

                                                      85

                                                      Source Accelerator Cavity Design Problem (Ko via Husbands)

                                                      86

                                                      100x100 Submatrix Along Diagonal

                                                      Summer School Lecture 787

                                                      Post-RCM Reordering

                                                      88

                                                      Effect of Combined RCM+TSP Reordering

                                                      Before Green + RedAfter Green + Blue

                                                      Summer School Lecture 789

                                                      2x speedups on Pentium 4 Power 4 hellip

                                                      Summary of Other Performance Optimizations

                                                      bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

                                                      bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

                                                      bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

                                                      90

                                                      Optimized Sparse Kernel Interface - OSKI

                                                      bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

                                                      bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

                                                      bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

                                                      software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

                                                      91

                                                      Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                      ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                      ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                      bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                      bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                      93

                                                      Example Classical Conjugate Gradient (CG)

                                                      SpMVs and dot products require communication in

                                                      each iteration

                                                      via CA Matrix Powers Kernel

                                                      Global reduction to compute G

                                                      94

                                                      Example CA-Conjugate Gradient

                                                      Local computations within inner loop require

                                                      no communication

                                                      Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                      ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                      ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                      bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                      bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                      96

                                                      Slower convergence due

                                                      to roundoff

                                                      Loss of accuracy due to roundoff

                                                      At s = 16 monomial basis is rank deficient Method breaks down

                                                      Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

                                                      CA-CG (monomial)CG

                                                      machine precision

                                                      97

                                                      Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                      ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                      ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                      bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                      bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                      What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

                                                      bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

                                                      bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

                                                      matrices

                                                      Explicit (O(nnz)) Implicit (o(nnz))

                                                      Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

                                                      Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

                                                      Indices

                                                      Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                      ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                      ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                      bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                      bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                      101

                                                      bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

                                                      ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

                                                      demanded by customers (construction engineers) otherwise they donrsquot believe results

                                                      ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

                                                      ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

                                                      Reproducible Floating Point Computation

                                                      Absolute Error for Random Vectors

                                                      Same magnitude opposite signs

                                                      Intel MKL non-reproducibility

                                                      Relative Error for Orthogonal vectors

                                                      Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

                                                      Sign notreproducible

                                                      103

                                                      bull Consider summation or dot productbull Goals

                                                      1 Same answer independent of layout processors order of summands

                                                      2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

                                                      bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

                                                      GoalsApproaches for Reproducibility

                                                      104

                                                      Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

                                                      Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

                                                      Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

                                                      bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

                                                      Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

                                                      bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

                                                      Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

                                                      Instruments NEC Nokia NVIDIA Samsung Oracle

                                                      bull bebopcsberkeleyedu

                                                      Summary

                                                      Donrsquot Communichellip

                                                      106

                                                      Time to redesign all linear algebra n-body hellip algorithms and software

                                                      (and compilers)

                                                      • Implementing Communication-Avoiding Algorithms
                                                      • Why avoid communication
                                                      • Goals
                                                      • Outline
                                                      • Outline (2)
                                                      • Lower bound for all ldquon3-likerdquo linear algebra
                                                      • Lower bound for all ldquon3-likerdquo linear algebra (2)
                                                      • Lower bound for all ldquon3-likerdquo linear algebra (3)
                                                      • Limits to parallel scaling (12)
                                                      • Limits to parallel scaling (22)
                                                      • Can we attain these lower bounds
                                                      • Outline (3)
                                                      • 25D Matrix Multiplication
                                                      • 25D Matrix Multiplication (2)
                                                      • 25D Matmul on BGP 16K nodes 64K cores (2)
                                                      • Perfect Strong Scaling ndash in Time and Energy (12)
                                                      • Perfect Strong Scaling ndash in Time and Energy (22)
                                                      • Handling Heterogeneity
                                                      • Application to Tensor Contractions
                                                      • C(ijk) = Σm A(ijm)B(mk)
                                                      • Application to Tensor Contractions (2)
                                                      • Communication Lower Bounds for Strassen-like matmul algorithms
                                                      • vs
                                                      • Slide 26
                                                      • Strassen-like beyond matmul
                                                      • Cache and Network Oblivious Algorithms
                                                      • CARMA Performance Distributed Memory
                                                      • CARMA Performance Distributed Memory (2)
                                                      • CARMA Performance Shared Memory
                                                      • CARMA Performance Shared Memory (2)
                                                      • Why is CARMA Faster in Shared Memory
                                                      • Outline (4)
                                                      • One-sided Factorizations (LU QR) so far
                                                      • TSQR An Architecture-Dependent Algorithm
                                                      • Back to LU Using similar idea for TSLU as TSQR Use reduction
                                                      • Minimizing Communication in TSLU
                                                      • Making TSLU Numerically Stable
                                                      • Stability of LU using TSLU CALU
                                                      • Why is stability of TSLU just a ldquoThmrdquo
                                                      • Fixing TSLU
                                                      • 2D CALU with Tournament Pivoting
                                                      • 25D CALU with Tournament Pivoting (c=4 copies)
                                                      • Exascale Machine Parameters Source DOE Exascale Workshop
                                                      • Exascale predicted speedups for Gaussian Elimination 2D CA
                                                      • 25D vs 2D LU With and Without Pivoting
                                                      • Other CA algorithms for Ax=b least squares(13)
                                                      • Other CA algorithms for Ax=b least squares (23)
                                                      • Other CA algorithms for Ax=b least squares (33)
                                                      • Outline (5)
                                                      • What about sparse matrices (13)
                                                      • Performance of 25D APSP using Kleene
                                                      • What about sparse matrices (23)
                                                      • What about sparse matrices (33)
                                                      • Outline (6)
                                                      • Symmetric Eigenproblem and SVD
                                                      • Slide 58
                                                      • Slide 59
                                                      • Slide 60
                                                      • Slide 61
                                                      • Slide 62
                                                      • Slide 63
                                                      • Slide 64
                                                      • Slide 65
                                                      • Slide 66
                                                      • Slide 67
                                                      • Slide 68
                                                      • Conventional vs CA - SBR
                                                      • Speedups of Sym Band Reduction vs DSBTRD
                                                      • Nonsymmetric Eigenproblem
                                                      • Attaining the Lower bounds Sequential
                                                      • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
                                                      • Outline (7)
                                                      • Avoiding Communication in Iterative Linear Algebra
                                                      • Outline (8)
                                                      • Example The Difficulty of Tuning SpMV
                                                      • Example The Difficulty of Tuning
                                                      • Speedups on Itanium 2 The Need for Search
                                                      • Register Profile Itanium 2
                                                      • Register Profiles IBM and Intel IA-64
                                                      • Another example of tuning challenges for SpMV
                                                      • Zoom in to top corner
                                                      • 3x3 blocks look natural buthellip
                                                      • Extra Work Can Improve Efficiency
                                                      • Slide 86
                                                      • Slide 87
                                                      • Slide 88
                                                      • Slide 89
                                                      • Summary of Other Performance Optimizations
                                                      • Optimized Sparse Kernel Interface - OSKI
                                                      • Outline (9)
                                                      • Example Classical Conjugate Gradient (CG)
                                                      • Example CA-Conjugate Gradient
                                                      • Outline (10)
                                                      • Slide 96
                                                      • Slide 97
                                                      • Outline (11)
                                                      • What is a ldquosparse matrixrdquo
                                                      • Outline (12)
                                                      • Reproducible Floating Point Computation
                                                      • Intel MKL non-reproducibility
                                                      • GoalsApproaches for Reproducibility
                                                      • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
                                                      • Collaborators and Supporters
                                                      • Summary

                                                        CARMA Performance Distributed Memory

                                                        Inner Product m = n = 192 k = 6291456

                                                        ScaLAPACK

                                                        CARMAPeak

                                                        (log)

                                                        (log)

                                                        Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA

                                                        CARMA Performance Shared Memory

                                                        Square m = k = n

                                                        MKL (double)CARMA (double)

                                                        MKL (single)CARMA (single)

                                                        Peak (single)

                                                        Peak (double)

                                                        (log)

                                                        (linear)

                                                        Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

                                                        CARMA Performance Shared Memory

                                                        Inner Product m = n = 64

                                                        MKL (double)

                                                        CARMA (double)

                                                        MKL (single)

                                                        CARMA (single)

                                                        (log)

                                                        (linear)

                                                        Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

                                                        Why is CARMA Faster in Shared MemoryL3 Cache Misses

                                                        Shared Memory Inner Product (m = n = 64 k = 524288)

                                                        97 Fewer Misses

                                                        86 Fewer Misses

                                                        (linear)

                                                        Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                        ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                        ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                        bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                        bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                        One-sided Factorizations (LU QR) so farbull Classical Approach for i=1 to n update column i update trailing matrixbull words_moved = O(n3)

                                                        35

                                                        bull Blocked Approach (LAPACK) for i=1 to nb update block i of b columns update trailing matrixbull words moved = O(n3M13)

                                                        bull Recursive Approach func factor(A) if A has 1 column update it

                                                        else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)

                                                        bull None of these approaches minimizes messagesbull Parallel case Partial

                                                        Pivoting =gt n reductionsbull Need another idea

                                                        TSQR An Architecture-Dependent Algorithm

                                                        W =

                                                        W0

                                                        W1

                                                        W2

                                                        W3

                                                        R00

                                                        R10

                                                        R20

                                                        R30

                                                        R01

                                                        R11

                                                        R02Parallel

                                                        W =

                                                        W0

                                                        W1

                                                        W2

                                                        W3

                                                        R01 R02

                                                        R00

                                                        R03

                                                        SequentialStreaming

                                                        W =

                                                        W0

                                                        W1

                                                        W2

                                                        W3

                                                        R00

                                                        R01R01

                                                        R11

                                                        R02

                                                        R11

                                                        R03

                                                        Dual Core

                                                        Can choose reduction tree dynamicallyMulticore Multisocket Multirack Multisite Out-of-core

                                                        Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

                                                        Wnxb =

                                                        W1

                                                        W2

                                                        W3

                                                        W4

                                                        P1middotL1middotU1

                                                        P2middotL2middotU2

                                                        P3middotL3middotU3

                                                        P4middotL4middotU4

                                                        =

                                                        Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

                                                        W1rsquoW2rsquoW3rsquoW4rsquo

                                                        P12middotL12middotU12

                                                        P34middotL34middotU34

                                                        =Choose b pivot rows call them W12rsquo

                                                        Choose b pivot rows call them W34rsquo

                                                        W12rsquoW34rsquo

                                                        = P1234middotL1234middotU1234

                                                        Choose b pivot rows

                                                        Go back to W and use these b pivot rows (move them to top do LU without pivoting)

                                                        37

                                                        Minimizing Communication in TSLU

                                                        W = W1

                                                        W2

                                                        W3

                                                        W4

                                                        LULULULU

                                                        LU

                                                        LULUParallel

                                                        W = W1

                                                        W2

                                                        W3

                                                        W4

                                                        LULU

                                                        LU

                                                        LUSequentialStreaming

                                                        W = W1

                                                        W2

                                                        W3

                                                        W4

                                                        LULU LU

                                                        LULU

                                                        LULU

                                                        Dual Core

                                                        Can choose reduction tree dynamically to match architecture as before

                                                        38

                                                        Making TSLU Numerically Stable

                                                        bull Details matterndash Going up the tree we could do LU either on original rows of A

                                                        (tournament pivoting) or computed rows of Undash Only tournament pivoting stable

                                                        bull ldquoThmrdquo New scheme as stable as Partial Pivoting (GEPP) in following sense Get same Schur complements as GEPP applied to different input matrix whose entries are blocks taken from input A

                                                        bull Why just a ldquoThmrdquo

                                                        39

                                                        Stability of LU using TSLU CALU

                                                        Summer School Lecture 4 40

                                                        bull Empirical testingndash Both random matrices and ldquospecial onesrdquondash Both binary tree (BCALU) and flat-tree (FCALU)ndash 3 metrics ||PA-LU||||A|| normwise and componentwise backward errorsndash See [D Grigori Xiang 2010] for details

                                                        Why is stability of TSLU just a ldquoThmrdquo

                                                        bull Proof is correct ndash in exact arithmeticbull Experiment

                                                        ndash Generate 100 random 6x6 rank 3 matrices in Matlabndash [LUP] = lu(A) do LU without pivoting on PA compare L factors are

                                                        they the samebull Compute || L ndash Lnp || A few 0rsquos A few infinrsquos a few NaNsbull Rest mostly O(1)

                                                        ndash Why Floating point is nonassociative doing arithmetic in different order gives different rounding errors

                                                        ndash Same experiment with rank 6 matrices || L ndash Lnp || usually nonzero O(macheps)

                                                        ndash Same experiment with 20x20 rank 4 matrices || L ndash Lnp || often O(103)

                                                        bull Much harder to break TSLU but possiblendash Occurred when using TSLU to factorize a low-rank subdiagonal

                                                        panel in symmetric-indefinite factorization 41

                                                        Fixing TSLU

                                                        bull Run TSLU quickly test for stability fix if necessary (rare)

                                                        bull Test conditioning of U if not tiny (usual case) proceed elsebull Compute || L || if not big (usual case) proceed elsebull Factor A = QR using TSQR thenbull Factor Q = PLU using TSLU thenbull A = PL(UR) with UR as upper triangular factor

                                                        bull Last topic in lecture how to guarantee floating point reproducibility

                                                        42

                                                        2D CALU with Tournament Pivoting

                                                        43

                                                        25D CALU with Tournament Pivoting (c=4 copies)

                                                        44

                                                        Exascale Machine ParametersSource DOE Exascale Workshop

                                                        bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

                                                        Exascale predicted speedupsfor Gaussian Elimination

                                                        2D CA-LU vs ScaLAPACK-LU

                                                        log2 (p)

                                                        log 2

                                                        (n2 p

                                                        ) =

                                                        log 2

                                                        (mem

                                                        ory_

                                                        per_

                                                        proc

                                                        )

                                                        Up to 29x

                                                        25D vs 2D LUWith and Without Pivoting

                                                        Other CA algorithms for Ax=b least squares(13)

                                                        bull A symmetric and indefinitendash Seek factorization that retains symmetry PAPT = LDLT D

                                                        ldquosimplerdquobull Save frac12 flops preserve inertia

                                                        ndash Usual approach Bunch-Kaufmanbull D block diagonal with 1x1 and 2x2 blocksbull Pivot search down column along row (lots of communication)

                                                        ndash Alternative Aasenbull D = tridiagonal = Tbull Two steps

                                                        ndash PAPT = LTLT where T is banded using TSLU

                                                        48

                                                        0 0

                                                        0

                                                        0 0

                                                        0

                                                        0

                                                        hellip

                                                        hellip

                                                        ndash Solvefactor narrow band problem with Tbull Up to 28x faster than MKL Best Paper at IPDPSrsquo13

                                                        Other CA algorithms for Ax=b least squares (23)bull Minimizing bandwidth and latency for sequential GEPP

                                                        ndash So far could not do partial pivoting and minimize messages just words

                                                        ndash Challengebull Column layout good for choosing pivots bad for matmulbull Blocked layout good for matmul bad for choosing pivots

                                                        ndash Solution use both layouts switching between thembull ldquoShape Morphing LUrdquo or SMLU

                                                        49

                                                        bull func factor(A) if A has 1 column update it else factor(left half of A)

                                                        update right half of A

                                                        factor(right half of A)

                                                        bull Words = O(n3M12)

                                                        bull Messages = O(n3M)

                                                        bull func factor(A) if A has 1 column update it else factor(left half of A) reshape to recursive block format update right half of A reshape to columnwise format factor(right half of A)

                                                        bull Words = O(n3M12)

                                                        bull Messages = O(n3M32)

                                                        Other CA algorithms for Ax=b least squares (33)bull Need for pivoting arises beyond LU in QR

                                                        ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)

                                                        ndash Usual approach like Partial Pivotingbull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound

                                                        ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two

                                                        groups of b columns either using usual approach or something better (GuEisenstat)

                                                        bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix

                                                        ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 50

                                                        Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                        ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                        ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                        bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                        bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                        What about sparse matrices (13)

                                                        bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

                                                        bull But canrsquot reorder outer loop for 25D need another ideabull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = AB

                                                        ndash Dependencies ok 25D works just different semiringbull Kleenersquos Algorithm

                                                        52

                                                        for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

                                                        D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

                                                        Performance of 25D APSP using Kleene

                                                        53

                                                        Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

                                                        62xspeedup

                                                        2x speedup

                                                        What about sparse matrices (23)

                                                        bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of G(A)

                                                        have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

                                                        bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering for

                                                        2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

                                                        bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

                                                        multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

                                                        separators)

                                                        54

                                                        What about sparse matrices (33)

                                                        bull If matrix stays very sparse lower bound unattainable new one bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

                                                        data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

                                                        bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

                                                        Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

                                                        bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

                                                        along dimensions most likely to minimize cost55

                                                        Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                        ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                        ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                        bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                        bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                        Symmetric Eigenproblem and SVD

                                                        bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd

                                                        bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12

                                                        ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea

                                                        b+1

                                                        b+1

                                                        Successive Band Reduction (BischofLangSun)

                                                        1

                                                        b+1

                                                        b+1

                                                        d+1

                                                        c

                                                        Successive Band Reduction (BischofLangSun)

                                                        b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                        1Q1

                                                        b+1

                                                        b+1

                                                        d+1

                                                        c

                                                        b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                        Successive Band Reduction (BischofLangSun)

                                                        12

                                                        Q1

                                                        b+1

                                                        b+1

                                                        d+1

                                                        d+c

                                                        d+c

                                                        c

                                                        b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                        Successive Band Reduction (BischofLangSun)

                                                        1

                                                        12

                                                        Q1

                                                        Q1T

                                                        b+1

                                                        b+1

                                                        d+1

                                                        d+1

                                                        cd+c

                                                        d+c

                                                        c

                                                        b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                        Successive Band Reduction (BischofLangSun)

                                                        1

                                                        1

                                                        2

                                                        2Q1

                                                        Q1T

                                                        b+1

                                                        b+1

                                                        d+1

                                                        d+1

                                                        cd+c

                                                        d+c

                                                        d+c

                                                        d+c

                                                        c

                                                        b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                        Successive Band Reduction (BischofLangSun)

                                                        1

                                                        1

                                                        2

                                                        2

                                                        3

                                                        3

                                                        Q1

                                                        Q1T

                                                        Q2

                                                        Q2T

                                                        b+1

                                                        b+1

                                                        d+1

                                                        d+1

                                                        d+c

                                                        d+c

                                                        d+c

                                                        d+c

                                                        c

                                                        c

                                                        b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                        Successive Band Reduction (BischofLangSun)

                                                        1

                                                        1

                                                        2

                                                        2

                                                        3

                                                        3

                                                        4

                                                        4

                                                        Q1

                                                        Q1T

                                                        Q2

                                                        Q2T

                                                        Q3

                                                        Q3T

                                                        b+1

                                                        b+1

                                                        d+1

                                                        d+1

                                                        d+c

                                                        d+c

                                                        d+c

                                                        d+c

                                                        c

                                                        c

                                                        b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                        Successive Band Reduction (BischofLangSun)

                                                        1

                                                        1

                                                        2

                                                        2

                                                        3

                                                        3

                                                        4

                                                        4

                                                        5

                                                        5

                                                        Q1

                                                        Q1T

                                                        Q2

                                                        Q2T

                                                        Q3

                                                        Q3T

                                                        Q4

                                                        Q4T

                                                        b+1

                                                        b+1

                                                        d+1

                                                        d+1

                                                        c

                                                        c

                                                        d+c

                                                        d+c

                                                        d+c

                                                        d+c

                                                        b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                        Successive Band Reduction (BischofLangSun)

                                                        1

                                                        1

                                                        2

                                                        2

                                                        3

                                                        3

                                                        4

                                                        4

                                                        5

                                                        5

                                                        Q5T

                                                        Q1

                                                        Q1T

                                                        Q2

                                                        Q2T

                                                        Q3

                                                        Q3T

                                                        Q5

                                                        Q4

                                                        Q4T

                                                        b+1

                                                        b+1

                                                        d+1

                                                        d+1

                                                        c

                                                        c

                                                        d+c

                                                        d+c

                                                        d+c

                                                        d+c

                                                        b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                        Successive Band Reduction (BischofLangSun)

                                                        1

                                                        1

                                                        2

                                                        2

                                                        3

                                                        3

                                                        4

                                                        4

                                                        5

                                                        5

                                                        6

                                                        6

                                                        Q5T

                                                        Q1

                                                        Q1T

                                                        Q2

                                                        Q2T

                                                        Q3

                                                        Q3T

                                                        Q5

                                                        Q4

                                                        Q4T

                                                        b+1

                                                        b+1

                                                        d+1

                                                        d+1

                                                        c

                                                        c

                                                        d+c

                                                        d+c

                                                        d+c

                                                        d+c

                                                        b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                        Successive Band Reduction (BischofLangSun)

                                                        Conventional vs CA - SBR

                                                        Conventional Communication-Avoiding

                                                        Touch all data 4 times Touch all data once

                                                        >
                                                        >

                                                        Speedups of Sym Band Reductionvs DSBTRD

                                                        bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

                                                        bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

                                                        bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

                                                        bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

                                                        bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

                                                        Nonsymmetric Eigenproblem

                                                        bull No apparent way to modify standard algorithmbull Instead Spectral Divide-and-Conquer

                                                        ndash Find orthogonal matrix Q whose leading columns span an invariant subspace of A

                                                        ndash QTAQ will be block upper triangularndash Apply recursively to A11 A22

                                                        ndash Depends on randomization1 Randomized Rank Revealing QR decomposition2 Randomized location to try splitting spectrum

                                                        A11 A12

                                                        ε A22

                                                        Attaining the Lower bounds SequentialLegend[Existing][Ours][Math-Lib][Random]

                                                        Two Levels Memory Hierarchy

                                                        Words Messages Words Messages

                                                        BLAS-3 [FLPRrsquo99][BDLSTrsquo13][MKL etc] [FLPRrsquo99][BDLSTrsquo13][MKL etc]

                                                        Cholesky[Grsquo97][APrsquo00]

                                                        [LAPACK][BDHSrsquo09]

                                                        [Grsquo97][APrsquo00][BDHSrsquo09] [Grsquo97][APrsquo00][BDHSrsquo09]

                                                        Sym Indefinite [BBDDDPSTYrsquo13] [BBDDDPSTYrsquo13]

                                                        LU[Grsquo97][Trsquo97]

                                                        [GDXrsquo11][BDLSTrsquo13]

                                                        [GDXrsquo11][BDLSTrsquo13]

                                                        [Grsquo97][Trsquo97] [BDLSTrsquo13] [BDLSTrsquo13]

                                                        QR[EGrsquo98][FWrsquo03]

                                                        [DGHLrsquo12][BDLSTrsquo13]

                                                        [FWrsquo03][DGHLrsquo12][BDLSTrsquo13]

                                                        [EGrsquo98][FWrsquo03][BDLSTrsquo13]

                                                        [FWrsquo03][BDLSTrsquo13]

                                                        Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

                                                        Sym Eig amp SVD [BDDrsquo11][BDKrsquo13] [BDDrsquo11]

                                                        Non Sym Eig [BDDrsquo11] [BDDrsquo11]

                                                        Legend[Existing][Ours][Math-Lib][Random]

                                                        Words (BW) Messages (L) Saving factor

                                                        BLAS-3 [AGZrsquo94][MTrsquo99][ScaLAPACK][Crsquo69][vGWrsquo97][SDrsquo11] L nP12

                                                        Cholesky [ScaLAPACK][Trsquo99][SDrsquo11] L nP12

                                                        Sym Indefinite [BBDDDPSTYrsquo13][ScaLAPACK] [BBDDDPSTYrsquo13] L nP12

                                                        LU [ScaLAPACK][GDXrsquo11][Trsquo99][SDrsquo11] [GDXrsquo11][Trsquo99][SDrsquo11] L nP12

                                                        QR [ScaLAPACK][DGHLrsquo12] [Trsquo99] [DGHLrsquo12][Trsquo99] L nP12

                                                        Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

                                                        Sym Eig amp SVD [BDDrsquo11][BDKrsquo13][ScaLAPACK] [BDDrsquo11][BDKrsquo13] L nP12

                                                        Non-Sym Eig [BDDrsquo11] [BDDrsquo11] BW P12 L n

                                                        Attaining with extra memory 25D M=(cn2P)

                                                        Attaining the Lower bounds Parallel 2DM=(n2P)(Ignoring poly-log(P) factors words = ( n2 P12) messages = (P12)

                                                        Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                        ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                        ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                        bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                        bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                        Avoiding Communication in Iterative Linear Algebra

                                                        bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

                                                        bull Conjugate Gradients (CG) GMRES Lanczos Arnoldi hellip bull Goal minimize communication

                                                        ndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

                                                        bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

                                                        ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

                                                        bull Lots of speed up possible (modeled and measured)ndash Price some redundant computationndash Challenges Poor partitioning Preconditioning Num Stability

                                                        75

                                                        Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                        ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                        ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                        bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                        bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                        Example The Difficulty of Tuning SpMV

                                                        bull n = 21200bull nnz = 15 M

                                                        bull Source NASA structural analysis problem (raefsky)

                                                        77

                                                        Example The Difficulty of Tuning

                                                        bull n = 21200bull nnz = 15 M

                                                        bull Source NASA structural analysis problem (raefsky)

                                                        bull 8x8 dense substructure exploit this to limit mem_refs

                                                        78

                                                        Speedups on Itanium 2 The Need for Search

                                                        Reference

                                                        Best 4x2

                                                        Mflops

                                                        Mflops

                                                        79

                                                        Register Profile Itanium 2

                                                        190 Mflops

                                                        1190 Mflops

                                                        80

                                                        Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16

                                                        Itanium 2 - 33Itanium 1 - 8

                                                        252 Mflops

                                                        122 Mflops

                                                        820 Mflops

                                                        459 Mflops

                                                        247 Mflops

                                                        107 Mflops

                                                        12 Gflops

                                                        190 Mflops

                                                        Another example of tuning challenges for SpMV

                                                        bull Ex11 matrix (fluid flow)

                                                        bull More complicated non-zero structure in general

                                                        bull N = 16614bull NNZ = 11M

                                                        82

                                                        Zoom in to top corner

                                                        bull More complicated non-zero structure in general

                                                        bull N = 16614bull NNZ = 11M

                                                        83

                                                        3x3 blocks look natural buthellip

                                                        bull Example 3x3 blockingndash Logical grid of 3x3 cells

                                                        bull But would lead to lots of ldquofill-inrdquo

                                                        84

                                                        Extra Work Can Improve Efficiency

                                                        bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

                                                        bull On Pentium III 15x speedupndash Actual mflop rate 152 = 225 higher

                                                        85

                                                        Source Accelerator Cavity Design Problem (Ko via Husbands)

                                                        86

                                                        100x100 Submatrix Along Diagonal

                                                        Summer School Lecture 787

                                                        Post-RCM Reordering

                                                        88

                                                        Effect of Combined RCM+TSP Reordering

                                                        Before Green + RedAfter Green + Blue

                                                        Summer School Lecture 789

                                                        2x speedups on Pentium 4 Power 4 hellip

                                                        Summary of Other Performance Optimizations

                                                        bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

                                                        bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

                                                        bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

                                                        90

                                                        Optimized Sparse Kernel Interface - OSKI

                                                        bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

                                                        bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

                                                        bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

                                                        software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

                                                        91

                                                        Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                        ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                        ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                        bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                        bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                        93

                                                        Example Classical Conjugate Gradient (CG)

                                                        SpMVs and dot products require communication in

                                                        each iteration

                                                        via CA Matrix Powers Kernel

                                                        Global reduction to compute G

                                                        94

                                                        Example CA-Conjugate Gradient

                                                        Local computations within inner loop require

                                                        no communication

                                                        Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                        ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                        ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                        bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                        bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                        96

                                                        Slower convergence due

                                                        to roundoff

                                                        Loss of accuracy due to roundoff

                                                        At s = 16 monomial basis is rank deficient Method breaks down

                                                        Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

                                                        CA-CG (monomial)CG

                                                        machine precision

                                                        97

                                                        Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                        ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                        ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                        bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                        bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                        What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

                                                        bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

                                                        bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

                                                        matrices

                                                        Explicit (O(nnz)) Implicit (o(nnz))

                                                        Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

                                                        Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

                                                        Indices

                                                        Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                        ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                        ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                        bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                        bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                        101

                                                        bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

                                                        ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

                                                        demanded by customers (construction engineers) otherwise they donrsquot believe results

                                                        ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

                                                        ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

                                                        Reproducible Floating Point Computation

                                                        Absolute Error for Random Vectors

                                                        Same magnitude opposite signs

                                                        Intel MKL non-reproducibility

                                                        Relative Error for Orthogonal vectors

                                                        Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

                                                        Sign notreproducible

                                                        103

                                                        bull Consider summation or dot productbull Goals

                                                        1 Same answer independent of layout processors order of summands

                                                        2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

                                                        bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

                                                        GoalsApproaches for Reproducibility

                                                        104

                                                        Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

                                                        Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

                                                        Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

                                                        bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

                                                        Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

                                                        bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

                                                        Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

                                                        Instruments NEC Nokia NVIDIA Samsung Oracle

                                                        bull bebopcsberkeleyedu

                                                        Summary

                                                        Donrsquot Communichellip

                                                        106

                                                        Time to redesign all linear algebra n-body hellip algorithms and software

                                                        (and compilers)

                                                        • Implementing Communication-Avoiding Algorithms
                                                        • Why avoid communication
                                                        • Goals
                                                        • Outline
                                                        • Outline (2)
                                                        • Lower bound for all ldquon3-likerdquo linear algebra
                                                        • Lower bound for all ldquon3-likerdquo linear algebra (2)
                                                        • Lower bound for all ldquon3-likerdquo linear algebra (3)
                                                        • Limits to parallel scaling (12)
                                                        • Limits to parallel scaling (22)
                                                        • Can we attain these lower bounds
                                                        • Outline (3)
                                                        • 25D Matrix Multiplication
                                                        • 25D Matrix Multiplication (2)
                                                        • 25D Matmul on BGP 16K nodes 64K cores (2)
                                                        • Perfect Strong Scaling ndash in Time and Energy (12)
                                                        • Perfect Strong Scaling ndash in Time and Energy (22)
                                                        • Handling Heterogeneity
                                                        • Application to Tensor Contractions
                                                        • C(ijk) = Σm A(ijm)B(mk)
                                                        • Application to Tensor Contractions (2)
                                                        • Communication Lower Bounds for Strassen-like matmul algorithms
                                                        • vs
                                                        • Slide 26
                                                        • Strassen-like beyond matmul
                                                        • Cache and Network Oblivious Algorithms
                                                        • CARMA Performance Distributed Memory
                                                        • CARMA Performance Distributed Memory (2)
                                                        • CARMA Performance Shared Memory
                                                        • CARMA Performance Shared Memory (2)
                                                        • Why is CARMA Faster in Shared Memory
                                                        • Outline (4)
                                                        • One-sided Factorizations (LU QR) so far
                                                        • TSQR An Architecture-Dependent Algorithm
                                                        • Back to LU Using similar idea for TSLU as TSQR Use reduction
                                                        • Minimizing Communication in TSLU
                                                        • Making TSLU Numerically Stable
                                                        • Stability of LU using TSLU CALU
                                                        • Why is stability of TSLU just a ldquoThmrdquo
                                                        • Fixing TSLU
                                                        • 2D CALU with Tournament Pivoting
                                                        • 25D CALU with Tournament Pivoting (c=4 copies)
                                                        • Exascale Machine Parameters Source DOE Exascale Workshop
                                                        • Exascale predicted speedups for Gaussian Elimination 2D CA
                                                        • 25D vs 2D LU With and Without Pivoting
                                                        • Other CA algorithms for Ax=b least squares(13)
                                                        • Other CA algorithms for Ax=b least squares (23)
                                                        • Other CA algorithms for Ax=b least squares (33)
                                                        • Outline (5)
                                                        • What about sparse matrices (13)
                                                        • Performance of 25D APSP using Kleene
                                                        • What about sparse matrices (23)
                                                        • What about sparse matrices (33)
                                                        • Outline (6)
                                                        • Symmetric Eigenproblem and SVD
                                                        • Slide 58
                                                        • Slide 59
                                                        • Slide 60
                                                        • Slide 61
                                                        • Slide 62
                                                        • Slide 63
                                                        • Slide 64
                                                        • Slide 65
                                                        • Slide 66
                                                        • Slide 67
                                                        • Slide 68
                                                        • Conventional vs CA - SBR
                                                        • Speedups of Sym Band Reduction vs DSBTRD
                                                        • Nonsymmetric Eigenproblem
                                                        • Attaining the Lower bounds Sequential
                                                        • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
                                                        • Outline (7)
                                                        • Avoiding Communication in Iterative Linear Algebra
                                                        • Outline (8)
                                                        • Example The Difficulty of Tuning SpMV
                                                        • Example The Difficulty of Tuning
                                                        • Speedups on Itanium 2 The Need for Search
                                                        • Register Profile Itanium 2
                                                        • Register Profiles IBM and Intel IA-64
                                                        • Another example of tuning challenges for SpMV
                                                        • Zoom in to top corner
                                                        • 3x3 blocks look natural buthellip
                                                        • Extra Work Can Improve Efficiency
                                                        • Slide 86
                                                        • Slide 87
                                                        • Slide 88
                                                        • Slide 89
                                                        • Summary of Other Performance Optimizations
                                                        • Optimized Sparse Kernel Interface - OSKI
                                                        • Outline (9)
                                                        • Example Classical Conjugate Gradient (CG)
                                                        • Example CA-Conjugate Gradient
                                                        • Outline (10)
                                                        • Slide 96
                                                        • Slide 97
                                                        • Outline (11)
                                                        • What is a ldquosparse matrixrdquo
                                                        • Outline (12)
                                                        • Reproducible Floating Point Computation
                                                        • Intel MKL non-reproducibility
                                                        • GoalsApproaches for Reproducibility
                                                        • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
                                                        • Collaborators and Supporters
                                                        • Summary

                                                          CARMA Performance Shared Memory

                                                          Square m = k = n

                                                          MKL (double)CARMA (double)

                                                          MKL (single)CARMA (single)

                                                          Peak (single)

                                                          Peak (double)

                                                          (log)

                                                          (linear)

                                                          Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

                                                          CARMA Performance Shared Memory

                                                          Inner Product m = n = 64

                                                          MKL (double)

                                                          CARMA (double)

                                                          MKL (single)

                                                          CARMA (single)

                                                          (log)

                                                          (linear)

                                                          Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

                                                          Why is CARMA Faster in Shared MemoryL3 Cache Misses

                                                          Shared Memory Inner Product (m = n = 64 k = 524288)

                                                          97 Fewer Misses

                                                          86 Fewer Misses

                                                          (linear)

                                                          Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                          ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                          ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                          bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                          bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                          One-sided Factorizations (LU QR) so farbull Classical Approach for i=1 to n update column i update trailing matrixbull words_moved = O(n3)

                                                          35

                                                          bull Blocked Approach (LAPACK) for i=1 to nb update block i of b columns update trailing matrixbull words moved = O(n3M13)

                                                          bull Recursive Approach func factor(A) if A has 1 column update it

                                                          else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)

                                                          bull None of these approaches minimizes messagesbull Parallel case Partial

                                                          Pivoting =gt n reductionsbull Need another idea

                                                          TSQR An Architecture-Dependent Algorithm

                                                          W =

                                                          W0

                                                          W1

                                                          W2

                                                          W3

                                                          R00

                                                          R10

                                                          R20

                                                          R30

                                                          R01

                                                          R11

                                                          R02Parallel

                                                          W =

                                                          W0

                                                          W1

                                                          W2

                                                          W3

                                                          R01 R02

                                                          R00

                                                          R03

                                                          SequentialStreaming

                                                          W =

                                                          W0

                                                          W1

                                                          W2

                                                          W3

                                                          R00

                                                          R01R01

                                                          R11

                                                          R02

                                                          R11

                                                          R03

                                                          Dual Core

                                                          Can choose reduction tree dynamicallyMulticore Multisocket Multirack Multisite Out-of-core

                                                          Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

                                                          Wnxb =

                                                          W1

                                                          W2

                                                          W3

                                                          W4

                                                          P1middotL1middotU1

                                                          P2middotL2middotU2

                                                          P3middotL3middotU3

                                                          P4middotL4middotU4

                                                          =

                                                          Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

                                                          W1rsquoW2rsquoW3rsquoW4rsquo

                                                          P12middotL12middotU12

                                                          P34middotL34middotU34

                                                          =Choose b pivot rows call them W12rsquo

                                                          Choose b pivot rows call them W34rsquo

                                                          W12rsquoW34rsquo

                                                          = P1234middotL1234middotU1234

                                                          Choose b pivot rows

                                                          Go back to W and use these b pivot rows (move them to top do LU without pivoting)

                                                          37

                                                          Minimizing Communication in TSLU

                                                          W = W1

                                                          W2

                                                          W3

                                                          W4

                                                          LULULULU

                                                          LU

                                                          LULUParallel

                                                          W = W1

                                                          W2

                                                          W3

                                                          W4

                                                          LULU

                                                          LU

                                                          LUSequentialStreaming

                                                          W = W1

                                                          W2

                                                          W3

                                                          W4

                                                          LULU LU

                                                          LULU

                                                          LULU

                                                          Dual Core

                                                          Can choose reduction tree dynamically to match architecture as before

                                                          38

                                                          Making TSLU Numerically Stable

                                                          bull Details matterndash Going up the tree we could do LU either on original rows of A

                                                          (tournament pivoting) or computed rows of Undash Only tournament pivoting stable

                                                          bull ldquoThmrdquo New scheme as stable as Partial Pivoting (GEPP) in following sense Get same Schur complements as GEPP applied to different input matrix whose entries are blocks taken from input A

                                                          bull Why just a ldquoThmrdquo

                                                          39

                                                          Stability of LU using TSLU CALU

                                                          Summer School Lecture 4 40

                                                          bull Empirical testingndash Both random matrices and ldquospecial onesrdquondash Both binary tree (BCALU) and flat-tree (FCALU)ndash 3 metrics ||PA-LU||||A|| normwise and componentwise backward errorsndash See [D Grigori Xiang 2010] for details

                                                          Why is stability of TSLU just a ldquoThmrdquo

                                                          bull Proof is correct ndash in exact arithmeticbull Experiment

                                                          ndash Generate 100 random 6x6 rank 3 matrices in Matlabndash [LUP] = lu(A) do LU without pivoting on PA compare L factors are

                                                          they the samebull Compute || L ndash Lnp || A few 0rsquos A few infinrsquos a few NaNsbull Rest mostly O(1)

                                                          ndash Why Floating point is nonassociative doing arithmetic in different order gives different rounding errors

                                                          ndash Same experiment with rank 6 matrices || L ndash Lnp || usually nonzero O(macheps)

                                                          ndash Same experiment with 20x20 rank 4 matrices || L ndash Lnp || often O(103)

                                                          bull Much harder to break TSLU but possiblendash Occurred when using TSLU to factorize a low-rank subdiagonal

                                                          panel in symmetric-indefinite factorization 41

                                                          Fixing TSLU

                                                          bull Run TSLU quickly test for stability fix if necessary (rare)

                                                          bull Test conditioning of U if not tiny (usual case) proceed elsebull Compute || L || if not big (usual case) proceed elsebull Factor A = QR using TSQR thenbull Factor Q = PLU using TSLU thenbull A = PL(UR) with UR as upper triangular factor

                                                          bull Last topic in lecture how to guarantee floating point reproducibility

                                                          42

                                                          2D CALU with Tournament Pivoting

                                                          43

                                                          25D CALU with Tournament Pivoting (c=4 copies)

                                                          44

                                                          Exascale Machine ParametersSource DOE Exascale Workshop

                                                          bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

                                                          Exascale predicted speedupsfor Gaussian Elimination

                                                          2D CA-LU vs ScaLAPACK-LU

                                                          log2 (p)

                                                          log 2

                                                          (n2 p

                                                          ) =

                                                          log 2

                                                          (mem

                                                          ory_

                                                          per_

                                                          proc

                                                          )

                                                          Up to 29x

                                                          25D vs 2D LUWith and Without Pivoting

                                                          Other CA algorithms for Ax=b least squares(13)

                                                          bull A symmetric and indefinitendash Seek factorization that retains symmetry PAPT = LDLT D

                                                          ldquosimplerdquobull Save frac12 flops preserve inertia

                                                          ndash Usual approach Bunch-Kaufmanbull D block diagonal with 1x1 and 2x2 blocksbull Pivot search down column along row (lots of communication)

                                                          ndash Alternative Aasenbull D = tridiagonal = Tbull Two steps

                                                          ndash PAPT = LTLT where T is banded using TSLU

                                                          48

                                                          0 0

                                                          0

                                                          0 0

                                                          0

                                                          0

                                                          hellip

                                                          hellip

                                                          ndash Solvefactor narrow band problem with Tbull Up to 28x faster than MKL Best Paper at IPDPSrsquo13

                                                          Other CA algorithms for Ax=b least squares (23)bull Minimizing bandwidth and latency for sequential GEPP

                                                          ndash So far could not do partial pivoting and minimize messages just words

                                                          ndash Challengebull Column layout good for choosing pivots bad for matmulbull Blocked layout good for matmul bad for choosing pivots

                                                          ndash Solution use both layouts switching between thembull ldquoShape Morphing LUrdquo or SMLU

                                                          49

                                                          bull func factor(A) if A has 1 column update it else factor(left half of A)

                                                          update right half of A

                                                          factor(right half of A)

                                                          bull Words = O(n3M12)

                                                          bull Messages = O(n3M)

                                                          bull func factor(A) if A has 1 column update it else factor(left half of A) reshape to recursive block format update right half of A reshape to columnwise format factor(right half of A)

                                                          bull Words = O(n3M12)

                                                          bull Messages = O(n3M32)

                                                          Other CA algorithms for Ax=b least squares (33)bull Need for pivoting arises beyond LU in QR

                                                          ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)

                                                          ndash Usual approach like Partial Pivotingbull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound

                                                          ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two

                                                          groups of b columns either using usual approach or something better (GuEisenstat)

                                                          bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix

                                                          ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 50

                                                          Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                          ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                          ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                          bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                          bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                          What about sparse matrices (13)

                                                          bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

                                                          bull But canrsquot reorder outer loop for 25D need another ideabull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = AB

                                                          ndash Dependencies ok 25D works just different semiringbull Kleenersquos Algorithm

                                                          52

                                                          for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

                                                          D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

                                                          Performance of 25D APSP using Kleene

                                                          53

                                                          Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

                                                          62xspeedup

                                                          2x speedup

                                                          What about sparse matrices (23)

                                                          bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of G(A)

                                                          have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

                                                          bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering for

                                                          2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

                                                          bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

                                                          multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

                                                          separators)

                                                          54

                                                          What about sparse matrices (33)

                                                          bull If matrix stays very sparse lower bound unattainable new one bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

                                                          data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

                                                          bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

                                                          Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

                                                          bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

                                                          along dimensions most likely to minimize cost55

                                                          Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                          ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                          ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                          bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                          bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                          Symmetric Eigenproblem and SVD

                                                          bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd

                                                          bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12

                                                          ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea

                                                          b+1

                                                          b+1

                                                          Successive Band Reduction (BischofLangSun)

                                                          1

                                                          b+1

                                                          b+1

                                                          d+1

                                                          c

                                                          Successive Band Reduction (BischofLangSun)

                                                          b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                          1Q1

                                                          b+1

                                                          b+1

                                                          d+1

                                                          c

                                                          b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                          Successive Band Reduction (BischofLangSun)

                                                          12

                                                          Q1

                                                          b+1

                                                          b+1

                                                          d+1

                                                          d+c

                                                          d+c

                                                          c

                                                          b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                          Successive Band Reduction (BischofLangSun)

                                                          1

                                                          12

                                                          Q1

                                                          Q1T

                                                          b+1

                                                          b+1

                                                          d+1

                                                          d+1

                                                          cd+c

                                                          d+c

                                                          c

                                                          b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                          Successive Band Reduction (BischofLangSun)

                                                          1

                                                          1

                                                          2

                                                          2Q1

                                                          Q1T

                                                          b+1

                                                          b+1

                                                          d+1

                                                          d+1

                                                          cd+c

                                                          d+c

                                                          d+c

                                                          d+c

                                                          c

                                                          b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                          Successive Band Reduction (BischofLangSun)

                                                          1

                                                          1

                                                          2

                                                          2

                                                          3

                                                          3

                                                          Q1

                                                          Q1T

                                                          Q2

                                                          Q2T

                                                          b+1

                                                          b+1

                                                          d+1

                                                          d+1

                                                          d+c

                                                          d+c

                                                          d+c

                                                          d+c

                                                          c

                                                          c

                                                          b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                          Successive Band Reduction (BischofLangSun)

                                                          1

                                                          1

                                                          2

                                                          2

                                                          3

                                                          3

                                                          4

                                                          4

                                                          Q1

                                                          Q1T

                                                          Q2

                                                          Q2T

                                                          Q3

                                                          Q3T

                                                          b+1

                                                          b+1

                                                          d+1

                                                          d+1

                                                          d+c

                                                          d+c

                                                          d+c

                                                          d+c

                                                          c

                                                          c

                                                          b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                          Successive Band Reduction (BischofLangSun)

                                                          1

                                                          1

                                                          2

                                                          2

                                                          3

                                                          3

                                                          4

                                                          4

                                                          5

                                                          5

                                                          Q1

                                                          Q1T

                                                          Q2

                                                          Q2T

                                                          Q3

                                                          Q3T

                                                          Q4

                                                          Q4T

                                                          b+1

                                                          b+1

                                                          d+1

                                                          d+1

                                                          c

                                                          c

                                                          d+c

                                                          d+c

                                                          d+c

                                                          d+c

                                                          b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                          Successive Band Reduction (BischofLangSun)

                                                          1

                                                          1

                                                          2

                                                          2

                                                          3

                                                          3

                                                          4

                                                          4

                                                          5

                                                          5

                                                          Q5T

                                                          Q1

                                                          Q1T

                                                          Q2

                                                          Q2T

                                                          Q3

                                                          Q3T

                                                          Q5

                                                          Q4

                                                          Q4T

                                                          b+1

                                                          b+1

                                                          d+1

                                                          d+1

                                                          c

                                                          c

                                                          d+c

                                                          d+c

                                                          d+c

                                                          d+c

                                                          b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                          Successive Band Reduction (BischofLangSun)

                                                          1

                                                          1

                                                          2

                                                          2

                                                          3

                                                          3

                                                          4

                                                          4

                                                          5

                                                          5

                                                          6

                                                          6

                                                          Q5T

                                                          Q1

                                                          Q1T

                                                          Q2

                                                          Q2T

                                                          Q3

                                                          Q3T

                                                          Q5

                                                          Q4

                                                          Q4T

                                                          b+1

                                                          b+1

                                                          d+1

                                                          d+1

                                                          c

                                                          c

                                                          d+c

                                                          d+c

                                                          d+c

                                                          d+c

                                                          b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                          Successive Band Reduction (BischofLangSun)

                                                          Conventional vs CA - SBR

                                                          Conventional Communication-Avoiding

                                                          Touch all data 4 times Touch all data once

                                                          >
                                                          >

                                                          Speedups of Sym Band Reductionvs DSBTRD

                                                          bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

                                                          bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

                                                          bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

                                                          bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

                                                          bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

                                                          Nonsymmetric Eigenproblem

                                                          bull No apparent way to modify standard algorithmbull Instead Spectral Divide-and-Conquer

                                                          ndash Find orthogonal matrix Q whose leading columns span an invariant subspace of A

                                                          ndash QTAQ will be block upper triangularndash Apply recursively to A11 A22

                                                          ndash Depends on randomization1 Randomized Rank Revealing QR decomposition2 Randomized location to try splitting spectrum

                                                          A11 A12

                                                          ε A22

                                                          Attaining the Lower bounds SequentialLegend[Existing][Ours][Math-Lib][Random]

                                                          Two Levels Memory Hierarchy

                                                          Words Messages Words Messages

                                                          BLAS-3 [FLPRrsquo99][BDLSTrsquo13][MKL etc] [FLPRrsquo99][BDLSTrsquo13][MKL etc]

                                                          Cholesky[Grsquo97][APrsquo00]

                                                          [LAPACK][BDHSrsquo09]

                                                          [Grsquo97][APrsquo00][BDHSrsquo09] [Grsquo97][APrsquo00][BDHSrsquo09]

                                                          Sym Indefinite [BBDDDPSTYrsquo13] [BBDDDPSTYrsquo13]

                                                          LU[Grsquo97][Trsquo97]

                                                          [GDXrsquo11][BDLSTrsquo13]

                                                          [GDXrsquo11][BDLSTrsquo13]

                                                          [Grsquo97][Trsquo97] [BDLSTrsquo13] [BDLSTrsquo13]

                                                          QR[EGrsquo98][FWrsquo03]

                                                          [DGHLrsquo12][BDLSTrsquo13]

                                                          [FWrsquo03][DGHLrsquo12][BDLSTrsquo13]

                                                          [EGrsquo98][FWrsquo03][BDLSTrsquo13]

                                                          [FWrsquo03][BDLSTrsquo13]

                                                          Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

                                                          Sym Eig amp SVD [BDDrsquo11][BDKrsquo13] [BDDrsquo11]

                                                          Non Sym Eig [BDDrsquo11] [BDDrsquo11]

                                                          Legend[Existing][Ours][Math-Lib][Random]

                                                          Words (BW) Messages (L) Saving factor

                                                          BLAS-3 [AGZrsquo94][MTrsquo99][ScaLAPACK][Crsquo69][vGWrsquo97][SDrsquo11] L nP12

                                                          Cholesky [ScaLAPACK][Trsquo99][SDrsquo11] L nP12

                                                          Sym Indefinite [BBDDDPSTYrsquo13][ScaLAPACK] [BBDDDPSTYrsquo13] L nP12

                                                          LU [ScaLAPACK][GDXrsquo11][Trsquo99][SDrsquo11] [GDXrsquo11][Trsquo99][SDrsquo11] L nP12

                                                          QR [ScaLAPACK][DGHLrsquo12] [Trsquo99] [DGHLrsquo12][Trsquo99] L nP12

                                                          Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

                                                          Sym Eig amp SVD [BDDrsquo11][BDKrsquo13][ScaLAPACK] [BDDrsquo11][BDKrsquo13] L nP12

                                                          Non-Sym Eig [BDDrsquo11] [BDDrsquo11] BW P12 L n

                                                          Attaining with extra memory 25D M=(cn2P)

                                                          Attaining the Lower bounds Parallel 2DM=(n2P)(Ignoring poly-log(P) factors words = ( n2 P12) messages = (P12)

                                                          Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                          ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                          ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                          bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                          bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                          Avoiding Communication in Iterative Linear Algebra

                                                          bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

                                                          bull Conjugate Gradients (CG) GMRES Lanczos Arnoldi hellip bull Goal minimize communication

                                                          ndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

                                                          bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

                                                          ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

                                                          bull Lots of speed up possible (modeled and measured)ndash Price some redundant computationndash Challenges Poor partitioning Preconditioning Num Stability

                                                          75

                                                          Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                          ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                          ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                          bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                          bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                          Example The Difficulty of Tuning SpMV

                                                          bull n = 21200bull nnz = 15 M

                                                          bull Source NASA structural analysis problem (raefsky)

                                                          77

                                                          Example The Difficulty of Tuning

                                                          bull n = 21200bull nnz = 15 M

                                                          bull Source NASA structural analysis problem (raefsky)

                                                          bull 8x8 dense substructure exploit this to limit mem_refs

                                                          78

                                                          Speedups on Itanium 2 The Need for Search

                                                          Reference

                                                          Best 4x2

                                                          Mflops

                                                          Mflops

                                                          79

                                                          Register Profile Itanium 2

                                                          190 Mflops

                                                          1190 Mflops

                                                          80

                                                          Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16

                                                          Itanium 2 - 33Itanium 1 - 8

                                                          252 Mflops

                                                          122 Mflops

                                                          820 Mflops

                                                          459 Mflops

                                                          247 Mflops

                                                          107 Mflops

                                                          12 Gflops

                                                          190 Mflops

                                                          Another example of tuning challenges for SpMV

                                                          bull Ex11 matrix (fluid flow)

                                                          bull More complicated non-zero structure in general

                                                          bull N = 16614bull NNZ = 11M

                                                          82

                                                          Zoom in to top corner

                                                          bull More complicated non-zero structure in general

                                                          bull N = 16614bull NNZ = 11M

                                                          83

                                                          3x3 blocks look natural buthellip

                                                          bull Example 3x3 blockingndash Logical grid of 3x3 cells

                                                          bull But would lead to lots of ldquofill-inrdquo

                                                          84

                                                          Extra Work Can Improve Efficiency

                                                          bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

                                                          bull On Pentium III 15x speedupndash Actual mflop rate 152 = 225 higher

                                                          85

                                                          Source Accelerator Cavity Design Problem (Ko via Husbands)

                                                          86

                                                          100x100 Submatrix Along Diagonal

                                                          Summer School Lecture 787

                                                          Post-RCM Reordering

                                                          88

                                                          Effect of Combined RCM+TSP Reordering

                                                          Before Green + RedAfter Green + Blue

                                                          Summer School Lecture 789

                                                          2x speedups on Pentium 4 Power 4 hellip

                                                          Summary of Other Performance Optimizations

                                                          bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

                                                          bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

                                                          bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

                                                          90

                                                          Optimized Sparse Kernel Interface - OSKI

                                                          bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

                                                          bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

                                                          bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

                                                          software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

                                                          91

                                                          Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                          ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                          ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                          bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                          bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                          93

                                                          Example Classical Conjugate Gradient (CG)

                                                          SpMVs and dot products require communication in

                                                          each iteration

                                                          via CA Matrix Powers Kernel

                                                          Global reduction to compute G

                                                          94

                                                          Example CA-Conjugate Gradient

                                                          Local computations within inner loop require

                                                          no communication

                                                          Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                          ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                          ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                          bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                          bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                          96

                                                          Slower convergence due

                                                          to roundoff

                                                          Loss of accuracy due to roundoff

                                                          At s = 16 monomial basis is rank deficient Method breaks down

                                                          Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

                                                          CA-CG (monomial)CG

                                                          machine precision

                                                          97

                                                          Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                          ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                          ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                          bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                          bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                          What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

                                                          bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

                                                          bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

                                                          matrices

                                                          Explicit (O(nnz)) Implicit (o(nnz))

                                                          Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

                                                          Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

                                                          Indices

                                                          Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                          ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                          ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                          bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                          bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                          101

                                                          bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

                                                          ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

                                                          demanded by customers (construction engineers) otherwise they donrsquot believe results

                                                          ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

                                                          ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

                                                          Reproducible Floating Point Computation

                                                          Absolute Error for Random Vectors

                                                          Same magnitude opposite signs

                                                          Intel MKL non-reproducibility

                                                          Relative Error for Orthogonal vectors

                                                          Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

                                                          Sign notreproducible

                                                          103

                                                          bull Consider summation or dot productbull Goals

                                                          1 Same answer independent of layout processors order of summands

                                                          2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

                                                          bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

                                                          GoalsApproaches for Reproducibility

                                                          104

                                                          Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

                                                          Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

                                                          Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

                                                          bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

                                                          Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

                                                          bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

                                                          Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

                                                          Instruments NEC Nokia NVIDIA Samsung Oracle

                                                          bull bebopcsberkeleyedu

                                                          Summary

                                                          Donrsquot Communichellip

                                                          106

                                                          Time to redesign all linear algebra n-body hellip algorithms and software

                                                          (and compilers)

                                                          • Implementing Communication-Avoiding Algorithms
                                                          • Why avoid communication
                                                          • Goals
                                                          • Outline
                                                          • Outline (2)
                                                          • Lower bound for all ldquon3-likerdquo linear algebra
                                                          • Lower bound for all ldquon3-likerdquo linear algebra (2)
                                                          • Lower bound for all ldquon3-likerdquo linear algebra (3)
                                                          • Limits to parallel scaling (12)
                                                          • Limits to parallel scaling (22)
                                                          • Can we attain these lower bounds
                                                          • Outline (3)
                                                          • 25D Matrix Multiplication
                                                          • 25D Matrix Multiplication (2)
                                                          • 25D Matmul on BGP 16K nodes 64K cores (2)
                                                          • Perfect Strong Scaling ndash in Time and Energy (12)
                                                          • Perfect Strong Scaling ndash in Time and Energy (22)
                                                          • Handling Heterogeneity
                                                          • Application to Tensor Contractions
                                                          • C(ijk) = Σm A(ijm)B(mk)
                                                          • Application to Tensor Contractions (2)
                                                          • Communication Lower Bounds for Strassen-like matmul algorithms
                                                          • vs
                                                          • Slide 26
                                                          • Strassen-like beyond matmul
                                                          • Cache and Network Oblivious Algorithms
                                                          • CARMA Performance Distributed Memory
                                                          • CARMA Performance Distributed Memory (2)
                                                          • CARMA Performance Shared Memory
                                                          • CARMA Performance Shared Memory (2)
                                                          • Why is CARMA Faster in Shared Memory
                                                          • Outline (4)
                                                          • One-sided Factorizations (LU QR) so far
                                                          • TSQR An Architecture-Dependent Algorithm
                                                          • Back to LU Using similar idea for TSLU as TSQR Use reduction
                                                          • Minimizing Communication in TSLU
                                                          • Making TSLU Numerically Stable
                                                          • Stability of LU using TSLU CALU
                                                          • Why is stability of TSLU just a ldquoThmrdquo
                                                          • Fixing TSLU
                                                          • 2D CALU with Tournament Pivoting
                                                          • 25D CALU with Tournament Pivoting (c=4 copies)
                                                          • Exascale Machine Parameters Source DOE Exascale Workshop
                                                          • Exascale predicted speedups for Gaussian Elimination 2D CA
                                                          • 25D vs 2D LU With and Without Pivoting
                                                          • Other CA algorithms for Ax=b least squares(13)
                                                          • Other CA algorithms for Ax=b least squares (23)
                                                          • Other CA algorithms for Ax=b least squares (33)
                                                          • Outline (5)
                                                          • What about sparse matrices (13)
                                                          • Performance of 25D APSP using Kleene
                                                          • What about sparse matrices (23)
                                                          • What about sparse matrices (33)
                                                          • Outline (6)
                                                          • Symmetric Eigenproblem and SVD
                                                          • Slide 58
                                                          • Slide 59
                                                          • Slide 60
                                                          • Slide 61
                                                          • Slide 62
                                                          • Slide 63
                                                          • Slide 64
                                                          • Slide 65
                                                          • Slide 66
                                                          • Slide 67
                                                          • Slide 68
                                                          • Conventional vs CA - SBR
                                                          • Speedups of Sym Band Reduction vs DSBTRD
                                                          • Nonsymmetric Eigenproblem
                                                          • Attaining the Lower bounds Sequential
                                                          • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
                                                          • Outline (7)
                                                          • Avoiding Communication in Iterative Linear Algebra
                                                          • Outline (8)
                                                          • Example The Difficulty of Tuning SpMV
                                                          • Example The Difficulty of Tuning
                                                          • Speedups on Itanium 2 The Need for Search
                                                          • Register Profile Itanium 2
                                                          • Register Profiles IBM and Intel IA-64
                                                          • Another example of tuning challenges for SpMV
                                                          • Zoom in to top corner
                                                          • 3x3 blocks look natural buthellip
                                                          • Extra Work Can Improve Efficiency
                                                          • Slide 86
                                                          • Slide 87
                                                          • Slide 88
                                                          • Slide 89
                                                          • Summary of Other Performance Optimizations
                                                          • Optimized Sparse Kernel Interface - OSKI
                                                          • Outline (9)
                                                          • Example Classical Conjugate Gradient (CG)
                                                          • Example CA-Conjugate Gradient
                                                          • Outline (10)
                                                          • Slide 96
                                                          • Slide 97
                                                          • Outline (11)
                                                          • What is a ldquosparse matrixrdquo
                                                          • Outline (12)
                                                          • Reproducible Floating Point Computation
                                                          • Intel MKL non-reproducibility
                                                          • GoalsApproaches for Reproducibility
                                                          • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
                                                          • Collaborators and Supporters
                                                          • Summary

                                                            CARMA Performance Shared Memory

                                                            Inner Product m = n = 64

                                                            MKL (double)

                                                            CARMA (double)

                                                            MKL (single)

                                                            CARMA (single)

                                                            (log)

                                                            (linear)

                                                            Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

                                                            Why is CARMA Faster in Shared MemoryL3 Cache Misses

                                                            Shared Memory Inner Product (m = n = 64 k = 524288)

                                                            97 Fewer Misses

                                                            86 Fewer Misses

                                                            (linear)

                                                            Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                            ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                            ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                            bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                            bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                            One-sided Factorizations (LU QR) so farbull Classical Approach for i=1 to n update column i update trailing matrixbull words_moved = O(n3)

                                                            35

                                                            bull Blocked Approach (LAPACK) for i=1 to nb update block i of b columns update trailing matrixbull words moved = O(n3M13)

                                                            bull Recursive Approach func factor(A) if A has 1 column update it

                                                            else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)

                                                            bull None of these approaches minimizes messagesbull Parallel case Partial

                                                            Pivoting =gt n reductionsbull Need another idea

                                                            TSQR An Architecture-Dependent Algorithm

                                                            W =

                                                            W0

                                                            W1

                                                            W2

                                                            W3

                                                            R00

                                                            R10

                                                            R20

                                                            R30

                                                            R01

                                                            R11

                                                            R02Parallel

                                                            W =

                                                            W0

                                                            W1

                                                            W2

                                                            W3

                                                            R01 R02

                                                            R00

                                                            R03

                                                            SequentialStreaming

                                                            W =

                                                            W0

                                                            W1

                                                            W2

                                                            W3

                                                            R00

                                                            R01R01

                                                            R11

                                                            R02

                                                            R11

                                                            R03

                                                            Dual Core

                                                            Can choose reduction tree dynamicallyMulticore Multisocket Multirack Multisite Out-of-core

                                                            Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

                                                            Wnxb =

                                                            W1

                                                            W2

                                                            W3

                                                            W4

                                                            P1middotL1middotU1

                                                            P2middotL2middotU2

                                                            P3middotL3middotU3

                                                            P4middotL4middotU4

                                                            =

                                                            Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

                                                            W1rsquoW2rsquoW3rsquoW4rsquo

                                                            P12middotL12middotU12

                                                            P34middotL34middotU34

                                                            =Choose b pivot rows call them W12rsquo

                                                            Choose b pivot rows call them W34rsquo

                                                            W12rsquoW34rsquo

                                                            = P1234middotL1234middotU1234

                                                            Choose b pivot rows

                                                            Go back to W and use these b pivot rows (move them to top do LU without pivoting)

                                                            37

                                                            Minimizing Communication in TSLU

                                                            W = W1

                                                            W2

                                                            W3

                                                            W4

                                                            LULULULU

                                                            LU

                                                            LULUParallel

                                                            W = W1

                                                            W2

                                                            W3

                                                            W4

                                                            LULU

                                                            LU

                                                            LUSequentialStreaming

                                                            W = W1

                                                            W2

                                                            W3

                                                            W4

                                                            LULU LU

                                                            LULU

                                                            LULU

                                                            Dual Core

                                                            Can choose reduction tree dynamically to match architecture as before

                                                            38

                                                            Making TSLU Numerically Stable

                                                            bull Details matterndash Going up the tree we could do LU either on original rows of A

                                                            (tournament pivoting) or computed rows of Undash Only tournament pivoting stable

                                                            bull ldquoThmrdquo New scheme as stable as Partial Pivoting (GEPP) in following sense Get same Schur complements as GEPP applied to different input matrix whose entries are blocks taken from input A

                                                            bull Why just a ldquoThmrdquo

                                                            39

                                                            Stability of LU using TSLU CALU

                                                            Summer School Lecture 4 40

                                                            bull Empirical testingndash Both random matrices and ldquospecial onesrdquondash Both binary tree (BCALU) and flat-tree (FCALU)ndash 3 metrics ||PA-LU||||A|| normwise and componentwise backward errorsndash See [D Grigori Xiang 2010] for details

                                                            Why is stability of TSLU just a ldquoThmrdquo

                                                            bull Proof is correct ndash in exact arithmeticbull Experiment

                                                            ndash Generate 100 random 6x6 rank 3 matrices in Matlabndash [LUP] = lu(A) do LU without pivoting on PA compare L factors are

                                                            they the samebull Compute || L ndash Lnp || A few 0rsquos A few infinrsquos a few NaNsbull Rest mostly O(1)

                                                            ndash Why Floating point is nonassociative doing arithmetic in different order gives different rounding errors

                                                            ndash Same experiment with rank 6 matrices || L ndash Lnp || usually nonzero O(macheps)

                                                            ndash Same experiment with 20x20 rank 4 matrices || L ndash Lnp || often O(103)

                                                            bull Much harder to break TSLU but possiblendash Occurred when using TSLU to factorize a low-rank subdiagonal

                                                            panel in symmetric-indefinite factorization 41

                                                            Fixing TSLU

                                                            bull Run TSLU quickly test for stability fix if necessary (rare)

                                                            bull Test conditioning of U if not tiny (usual case) proceed elsebull Compute || L || if not big (usual case) proceed elsebull Factor A = QR using TSQR thenbull Factor Q = PLU using TSLU thenbull A = PL(UR) with UR as upper triangular factor

                                                            bull Last topic in lecture how to guarantee floating point reproducibility

                                                            42

                                                            2D CALU with Tournament Pivoting

                                                            43

                                                            25D CALU with Tournament Pivoting (c=4 copies)

                                                            44

                                                            Exascale Machine ParametersSource DOE Exascale Workshop

                                                            bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

                                                            Exascale predicted speedupsfor Gaussian Elimination

                                                            2D CA-LU vs ScaLAPACK-LU

                                                            log2 (p)

                                                            log 2

                                                            (n2 p

                                                            ) =

                                                            log 2

                                                            (mem

                                                            ory_

                                                            per_

                                                            proc

                                                            )

                                                            Up to 29x

                                                            25D vs 2D LUWith and Without Pivoting

                                                            Other CA algorithms for Ax=b least squares(13)

                                                            bull A symmetric and indefinitendash Seek factorization that retains symmetry PAPT = LDLT D

                                                            ldquosimplerdquobull Save frac12 flops preserve inertia

                                                            ndash Usual approach Bunch-Kaufmanbull D block diagonal with 1x1 and 2x2 blocksbull Pivot search down column along row (lots of communication)

                                                            ndash Alternative Aasenbull D = tridiagonal = Tbull Two steps

                                                            ndash PAPT = LTLT where T is banded using TSLU

                                                            48

                                                            0 0

                                                            0

                                                            0 0

                                                            0

                                                            0

                                                            hellip

                                                            hellip

                                                            ndash Solvefactor narrow band problem with Tbull Up to 28x faster than MKL Best Paper at IPDPSrsquo13

                                                            Other CA algorithms for Ax=b least squares (23)bull Minimizing bandwidth and latency for sequential GEPP

                                                            ndash So far could not do partial pivoting and minimize messages just words

                                                            ndash Challengebull Column layout good for choosing pivots bad for matmulbull Blocked layout good for matmul bad for choosing pivots

                                                            ndash Solution use both layouts switching between thembull ldquoShape Morphing LUrdquo or SMLU

                                                            49

                                                            bull func factor(A) if A has 1 column update it else factor(left half of A)

                                                            update right half of A

                                                            factor(right half of A)

                                                            bull Words = O(n3M12)

                                                            bull Messages = O(n3M)

                                                            bull func factor(A) if A has 1 column update it else factor(left half of A) reshape to recursive block format update right half of A reshape to columnwise format factor(right half of A)

                                                            bull Words = O(n3M12)

                                                            bull Messages = O(n3M32)

                                                            Other CA algorithms for Ax=b least squares (33)bull Need for pivoting arises beyond LU in QR

                                                            ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)

                                                            ndash Usual approach like Partial Pivotingbull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound

                                                            ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two

                                                            groups of b columns either using usual approach or something better (GuEisenstat)

                                                            bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix

                                                            ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 50

                                                            Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                            ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                            ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                            bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                            bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                            What about sparse matrices (13)

                                                            bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

                                                            bull But canrsquot reorder outer loop for 25D need another ideabull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = AB

                                                            ndash Dependencies ok 25D works just different semiringbull Kleenersquos Algorithm

                                                            52

                                                            for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

                                                            D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

                                                            Performance of 25D APSP using Kleene

                                                            53

                                                            Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

                                                            62xspeedup

                                                            2x speedup

                                                            What about sparse matrices (23)

                                                            bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of G(A)

                                                            have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

                                                            bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering for

                                                            2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

                                                            bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

                                                            multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

                                                            separators)

                                                            54

                                                            What about sparse matrices (33)

                                                            bull If matrix stays very sparse lower bound unattainable new one bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

                                                            data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

                                                            bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

                                                            Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

                                                            bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

                                                            along dimensions most likely to minimize cost55

                                                            Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                            ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                            ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                            bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                            bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                            Symmetric Eigenproblem and SVD

                                                            bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd

                                                            bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12

                                                            ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea

                                                            b+1

                                                            b+1

                                                            Successive Band Reduction (BischofLangSun)

                                                            1

                                                            b+1

                                                            b+1

                                                            d+1

                                                            c

                                                            Successive Band Reduction (BischofLangSun)

                                                            b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                            1Q1

                                                            b+1

                                                            b+1

                                                            d+1

                                                            c

                                                            b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                            Successive Band Reduction (BischofLangSun)

                                                            12

                                                            Q1

                                                            b+1

                                                            b+1

                                                            d+1

                                                            d+c

                                                            d+c

                                                            c

                                                            b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                            Successive Band Reduction (BischofLangSun)

                                                            1

                                                            12

                                                            Q1

                                                            Q1T

                                                            b+1

                                                            b+1

                                                            d+1

                                                            d+1

                                                            cd+c

                                                            d+c

                                                            c

                                                            b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                            Successive Band Reduction (BischofLangSun)

                                                            1

                                                            1

                                                            2

                                                            2Q1

                                                            Q1T

                                                            b+1

                                                            b+1

                                                            d+1

                                                            d+1

                                                            cd+c

                                                            d+c

                                                            d+c

                                                            d+c

                                                            c

                                                            b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                            Successive Band Reduction (BischofLangSun)

                                                            1

                                                            1

                                                            2

                                                            2

                                                            3

                                                            3

                                                            Q1

                                                            Q1T

                                                            Q2

                                                            Q2T

                                                            b+1

                                                            b+1

                                                            d+1

                                                            d+1

                                                            d+c

                                                            d+c

                                                            d+c

                                                            d+c

                                                            c

                                                            c

                                                            b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                            Successive Band Reduction (BischofLangSun)

                                                            1

                                                            1

                                                            2

                                                            2

                                                            3

                                                            3

                                                            4

                                                            4

                                                            Q1

                                                            Q1T

                                                            Q2

                                                            Q2T

                                                            Q3

                                                            Q3T

                                                            b+1

                                                            b+1

                                                            d+1

                                                            d+1

                                                            d+c

                                                            d+c

                                                            d+c

                                                            d+c

                                                            c

                                                            c

                                                            b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                            Successive Band Reduction (BischofLangSun)

                                                            1

                                                            1

                                                            2

                                                            2

                                                            3

                                                            3

                                                            4

                                                            4

                                                            5

                                                            5

                                                            Q1

                                                            Q1T

                                                            Q2

                                                            Q2T

                                                            Q3

                                                            Q3T

                                                            Q4

                                                            Q4T

                                                            b+1

                                                            b+1

                                                            d+1

                                                            d+1

                                                            c

                                                            c

                                                            d+c

                                                            d+c

                                                            d+c

                                                            d+c

                                                            b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                            Successive Band Reduction (BischofLangSun)

                                                            1

                                                            1

                                                            2

                                                            2

                                                            3

                                                            3

                                                            4

                                                            4

                                                            5

                                                            5

                                                            Q5T

                                                            Q1

                                                            Q1T

                                                            Q2

                                                            Q2T

                                                            Q3

                                                            Q3T

                                                            Q5

                                                            Q4

                                                            Q4T

                                                            b+1

                                                            b+1

                                                            d+1

                                                            d+1

                                                            c

                                                            c

                                                            d+c

                                                            d+c

                                                            d+c

                                                            d+c

                                                            b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                            Successive Band Reduction (BischofLangSun)

                                                            1

                                                            1

                                                            2

                                                            2

                                                            3

                                                            3

                                                            4

                                                            4

                                                            5

                                                            5

                                                            6

                                                            6

                                                            Q5T

                                                            Q1

                                                            Q1T

                                                            Q2

                                                            Q2T

                                                            Q3

                                                            Q3T

                                                            Q5

                                                            Q4

                                                            Q4T

                                                            b+1

                                                            b+1

                                                            d+1

                                                            d+1

                                                            c

                                                            c

                                                            d+c

                                                            d+c

                                                            d+c

                                                            d+c

                                                            b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                            Successive Band Reduction (BischofLangSun)

                                                            Conventional vs CA - SBR

                                                            Conventional Communication-Avoiding

                                                            Touch all data 4 times Touch all data once

                                                            >
                                                            >

                                                            Speedups of Sym Band Reductionvs DSBTRD

                                                            bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

                                                            bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

                                                            bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

                                                            bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

                                                            bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

                                                            Nonsymmetric Eigenproblem

                                                            bull No apparent way to modify standard algorithmbull Instead Spectral Divide-and-Conquer

                                                            ndash Find orthogonal matrix Q whose leading columns span an invariant subspace of A

                                                            ndash QTAQ will be block upper triangularndash Apply recursively to A11 A22

                                                            ndash Depends on randomization1 Randomized Rank Revealing QR decomposition2 Randomized location to try splitting spectrum

                                                            A11 A12

                                                            ε A22

                                                            Attaining the Lower bounds SequentialLegend[Existing][Ours][Math-Lib][Random]

                                                            Two Levels Memory Hierarchy

                                                            Words Messages Words Messages

                                                            BLAS-3 [FLPRrsquo99][BDLSTrsquo13][MKL etc] [FLPRrsquo99][BDLSTrsquo13][MKL etc]

                                                            Cholesky[Grsquo97][APrsquo00]

                                                            [LAPACK][BDHSrsquo09]

                                                            [Grsquo97][APrsquo00][BDHSrsquo09] [Grsquo97][APrsquo00][BDHSrsquo09]

                                                            Sym Indefinite [BBDDDPSTYrsquo13] [BBDDDPSTYrsquo13]

                                                            LU[Grsquo97][Trsquo97]

                                                            [GDXrsquo11][BDLSTrsquo13]

                                                            [GDXrsquo11][BDLSTrsquo13]

                                                            [Grsquo97][Trsquo97] [BDLSTrsquo13] [BDLSTrsquo13]

                                                            QR[EGrsquo98][FWrsquo03]

                                                            [DGHLrsquo12][BDLSTrsquo13]

                                                            [FWrsquo03][DGHLrsquo12][BDLSTrsquo13]

                                                            [EGrsquo98][FWrsquo03][BDLSTrsquo13]

                                                            [FWrsquo03][BDLSTrsquo13]

                                                            Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

                                                            Sym Eig amp SVD [BDDrsquo11][BDKrsquo13] [BDDrsquo11]

                                                            Non Sym Eig [BDDrsquo11] [BDDrsquo11]

                                                            Legend[Existing][Ours][Math-Lib][Random]

                                                            Words (BW) Messages (L) Saving factor

                                                            BLAS-3 [AGZrsquo94][MTrsquo99][ScaLAPACK][Crsquo69][vGWrsquo97][SDrsquo11] L nP12

                                                            Cholesky [ScaLAPACK][Trsquo99][SDrsquo11] L nP12

                                                            Sym Indefinite [BBDDDPSTYrsquo13][ScaLAPACK] [BBDDDPSTYrsquo13] L nP12

                                                            LU [ScaLAPACK][GDXrsquo11][Trsquo99][SDrsquo11] [GDXrsquo11][Trsquo99][SDrsquo11] L nP12

                                                            QR [ScaLAPACK][DGHLrsquo12] [Trsquo99] [DGHLrsquo12][Trsquo99] L nP12

                                                            Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

                                                            Sym Eig amp SVD [BDDrsquo11][BDKrsquo13][ScaLAPACK] [BDDrsquo11][BDKrsquo13] L nP12

                                                            Non-Sym Eig [BDDrsquo11] [BDDrsquo11] BW P12 L n

                                                            Attaining with extra memory 25D M=(cn2P)

                                                            Attaining the Lower bounds Parallel 2DM=(n2P)(Ignoring poly-log(P) factors words = ( n2 P12) messages = (P12)

                                                            Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                            ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                            ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                            bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                            bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                            Avoiding Communication in Iterative Linear Algebra

                                                            bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

                                                            bull Conjugate Gradients (CG) GMRES Lanczos Arnoldi hellip bull Goal minimize communication

                                                            ndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

                                                            bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

                                                            ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

                                                            bull Lots of speed up possible (modeled and measured)ndash Price some redundant computationndash Challenges Poor partitioning Preconditioning Num Stability

                                                            75

                                                            Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                            ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                            ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                            bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                            bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                            Example The Difficulty of Tuning SpMV

                                                            bull n = 21200bull nnz = 15 M

                                                            bull Source NASA structural analysis problem (raefsky)

                                                            77

                                                            Example The Difficulty of Tuning

                                                            bull n = 21200bull nnz = 15 M

                                                            bull Source NASA structural analysis problem (raefsky)

                                                            bull 8x8 dense substructure exploit this to limit mem_refs

                                                            78

                                                            Speedups on Itanium 2 The Need for Search

                                                            Reference

                                                            Best 4x2

                                                            Mflops

                                                            Mflops

                                                            79

                                                            Register Profile Itanium 2

                                                            190 Mflops

                                                            1190 Mflops

                                                            80

                                                            Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16

                                                            Itanium 2 - 33Itanium 1 - 8

                                                            252 Mflops

                                                            122 Mflops

                                                            820 Mflops

                                                            459 Mflops

                                                            247 Mflops

                                                            107 Mflops

                                                            12 Gflops

                                                            190 Mflops

                                                            Another example of tuning challenges for SpMV

                                                            bull Ex11 matrix (fluid flow)

                                                            bull More complicated non-zero structure in general

                                                            bull N = 16614bull NNZ = 11M

                                                            82

                                                            Zoom in to top corner

                                                            bull More complicated non-zero structure in general

                                                            bull N = 16614bull NNZ = 11M

                                                            83

                                                            3x3 blocks look natural buthellip

                                                            bull Example 3x3 blockingndash Logical grid of 3x3 cells

                                                            bull But would lead to lots of ldquofill-inrdquo

                                                            84

                                                            Extra Work Can Improve Efficiency

                                                            bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

                                                            bull On Pentium III 15x speedupndash Actual mflop rate 152 = 225 higher

                                                            85

                                                            Source Accelerator Cavity Design Problem (Ko via Husbands)

                                                            86

                                                            100x100 Submatrix Along Diagonal

                                                            Summer School Lecture 787

                                                            Post-RCM Reordering

                                                            88

                                                            Effect of Combined RCM+TSP Reordering

                                                            Before Green + RedAfter Green + Blue

                                                            Summer School Lecture 789

                                                            2x speedups on Pentium 4 Power 4 hellip

                                                            Summary of Other Performance Optimizations

                                                            bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

                                                            bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

                                                            bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

                                                            90

                                                            Optimized Sparse Kernel Interface - OSKI

                                                            bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

                                                            bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

                                                            bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

                                                            software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

                                                            91

                                                            Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                            ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                            ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                            bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                            bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                            93

                                                            Example Classical Conjugate Gradient (CG)

                                                            SpMVs and dot products require communication in

                                                            each iteration

                                                            via CA Matrix Powers Kernel

                                                            Global reduction to compute G

                                                            94

                                                            Example CA-Conjugate Gradient

                                                            Local computations within inner loop require

                                                            no communication

                                                            Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                            ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                            ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                            bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                            bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                            96

                                                            Slower convergence due

                                                            to roundoff

                                                            Loss of accuracy due to roundoff

                                                            At s = 16 monomial basis is rank deficient Method breaks down

                                                            Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

                                                            CA-CG (monomial)CG

                                                            machine precision

                                                            97

                                                            Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                            ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                            ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                            bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                            bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                            What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

                                                            bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

                                                            bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

                                                            matrices

                                                            Explicit (O(nnz)) Implicit (o(nnz))

                                                            Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

                                                            Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

                                                            Indices

                                                            Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                            ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                            ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                            bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                            bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                            101

                                                            bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

                                                            ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

                                                            demanded by customers (construction engineers) otherwise they donrsquot believe results

                                                            ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

                                                            ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

                                                            Reproducible Floating Point Computation

                                                            Absolute Error for Random Vectors

                                                            Same magnitude opposite signs

                                                            Intel MKL non-reproducibility

                                                            Relative Error for Orthogonal vectors

                                                            Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

                                                            Sign notreproducible

                                                            103

                                                            bull Consider summation or dot productbull Goals

                                                            1 Same answer independent of layout processors order of summands

                                                            2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

                                                            bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

                                                            GoalsApproaches for Reproducibility

                                                            104

                                                            Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

                                                            Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

                                                            Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

                                                            bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

                                                            Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

                                                            bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

                                                            Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

                                                            Instruments NEC Nokia NVIDIA Samsung Oracle

                                                            bull bebopcsberkeleyedu

                                                            Summary

                                                            Donrsquot Communichellip

                                                            106

                                                            Time to redesign all linear algebra n-body hellip algorithms and software

                                                            (and compilers)

                                                            • Implementing Communication-Avoiding Algorithms
                                                            • Why avoid communication
                                                            • Goals
                                                            • Outline
                                                            • Outline (2)
                                                            • Lower bound for all ldquon3-likerdquo linear algebra
                                                            • Lower bound for all ldquon3-likerdquo linear algebra (2)
                                                            • Lower bound for all ldquon3-likerdquo linear algebra (3)
                                                            • Limits to parallel scaling (12)
                                                            • Limits to parallel scaling (22)
                                                            • Can we attain these lower bounds
                                                            • Outline (3)
                                                            • 25D Matrix Multiplication
                                                            • 25D Matrix Multiplication (2)
                                                            • 25D Matmul on BGP 16K nodes 64K cores (2)
                                                            • Perfect Strong Scaling ndash in Time and Energy (12)
                                                            • Perfect Strong Scaling ndash in Time and Energy (22)
                                                            • Handling Heterogeneity
                                                            • Application to Tensor Contractions
                                                            • C(ijk) = Σm A(ijm)B(mk)
                                                            • Application to Tensor Contractions (2)
                                                            • Communication Lower Bounds for Strassen-like matmul algorithms
                                                            • vs
                                                            • Slide 26
                                                            • Strassen-like beyond matmul
                                                            • Cache and Network Oblivious Algorithms
                                                            • CARMA Performance Distributed Memory
                                                            • CARMA Performance Distributed Memory (2)
                                                            • CARMA Performance Shared Memory
                                                            • CARMA Performance Shared Memory (2)
                                                            • Why is CARMA Faster in Shared Memory
                                                            • Outline (4)
                                                            • One-sided Factorizations (LU QR) so far
                                                            • TSQR An Architecture-Dependent Algorithm
                                                            • Back to LU Using similar idea for TSLU as TSQR Use reduction
                                                            • Minimizing Communication in TSLU
                                                            • Making TSLU Numerically Stable
                                                            • Stability of LU using TSLU CALU
                                                            • Why is stability of TSLU just a ldquoThmrdquo
                                                            • Fixing TSLU
                                                            • 2D CALU with Tournament Pivoting
                                                            • 25D CALU with Tournament Pivoting (c=4 copies)
                                                            • Exascale Machine Parameters Source DOE Exascale Workshop
                                                            • Exascale predicted speedups for Gaussian Elimination 2D CA
                                                            • 25D vs 2D LU With and Without Pivoting
                                                            • Other CA algorithms for Ax=b least squares(13)
                                                            • Other CA algorithms for Ax=b least squares (23)
                                                            • Other CA algorithms for Ax=b least squares (33)
                                                            • Outline (5)
                                                            • What about sparse matrices (13)
                                                            • Performance of 25D APSP using Kleene
                                                            • What about sparse matrices (23)
                                                            • What about sparse matrices (33)
                                                            • Outline (6)
                                                            • Symmetric Eigenproblem and SVD
                                                            • Slide 58
                                                            • Slide 59
                                                            • Slide 60
                                                            • Slide 61
                                                            • Slide 62
                                                            • Slide 63
                                                            • Slide 64
                                                            • Slide 65
                                                            • Slide 66
                                                            • Slide 67
                                                            • Slide 68
                                                            • Conventional vs CA - SBR
                                                            • Speedups of Sym Band Reduction vs DSBTRD
                                                            • Nonsymmetric Eigenproblem
                                                            • Attaining the Lower bounds Sequential
                                                            • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
                                                            • Outline (7)
                                                            • Avoiding Communication in Iterative Linear Algebra
                                                            • Outline (8)
                                                            • Example The Difficulty of Tuning SpMV
                                                            • Example The Difficulty of Tuning
                                                            • Speedups on Itanium 2 The Need for Search
                                                            • Register Profile Itanium 2
                                                            • Register Profiles IBM and Intel IA-64
                                                            • Another example of tuning challenges for SpMV
                                                            • Zoom in to top corner
                                                            • 3x3 blocks look natural buthellip
                                                            • Extra Work Can Improve Efficiency
                                                            • Slide 86
                                                            • Slide 87
                                                            • Slide 88
                                                            • Slide 89
                                                            • Summary of Other Performance Optimizations
                                                            • Optimized Sparse Kernel Interface - OSKI
                                                            • Outline (9)
                                                            • Example Classical Conjugate Gradient (CG)
                                                            • Example CA-Conjugate Gradient
                                                            • Outline (10)
                                                            • Slide 96
                                                            • Slide 97
                                                            • Outline (11)
                                                            • What is a ldquosparse matrixrdquo
                                                            • Outline (12)
                                                            • Reproducible Floating Point Computation
                                                            • Intel MKL non-reproducibility
                                                            • GoalsApproaches for Reproducibility
                                                            • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
                                                            • Collaborators and Supporters
                                                            • Summary

                                                              Why is CARMA Faster in Shared MemoryL3 Cache Misses

                                                              Shared Memory Inner Product (m = n = 64 k = 524288)

                                                              97 Fewer Misses

                                                              86 Fewer Misses

                                                              (linear)

                                                              Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                              ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                              ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                              bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                              bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                              One-sided Factorizations (LU QR) so farbull Classical Approach for i=1 to n update column i update trailing matrixbull words_moved = O(n3)

                                                              35

                                                              bull Blocked Approach (LAPACK) for i=1 to nb update block i of b columns update trailing matrixbull words moved = O(n3M13)

                                                              bull Recursive Approach func factor(A) if A has 1 column update it

                                                              else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)

                                                              bull None of these approaches minimizes messagesbull Parallel case Partial

                                                              Pivoting =gt n reductionsbull Need another idea

                                                              TSQR An Architecture-Dependent Algorithm

                                                              W =

                                                              W0

                                                              W1

                                                              W2

                                                              W3

                                                              R00

                                                              R10

                                                              R20

                                                              R30

                                                              R01

                                                              R11

                                                              R02Parallel

                                                              W =

                                                              W0

                                                              W1

                                                              W2

                                                              W3

                                                              R01 R02

                                                              R00

                                                              R03

                                                              SequentialStreaming

                                                              W =

                                                              W0

                                                              W1

                                                              W2

                                                              W3

                                                              R00

                                                              R01R01

                                                              R11

                                                              R02

                                                              R11

                                                              R03

                                                              Dual Core

                                                              Can choose reduction tree dynamicallyMulticore Multisocket Multirack Multisite Out-of-core

                                                              Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

                                                              Wnxb =

                                                              W1

                                                              W2

                                                              W3

                                                              W4

                                                              P1middotL1middotU1

                                                              P2middotL2middotU2

                                                              P3middotL3middotU3

                                                              P4middotL4middotU4

                                                              =

                                                              Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

                                                              W1rsquoW2rsquoW3rsquoW4rsquo

                                                              P12middotL12middotU12

                                                              P34middotL34middotU34

                                                              =Choose b pivot rows call them W12rsquo

                                                              Choose b pivot rows call them W34rsquo

                                                              W12rsquoW34rsquo

                                                              = P1234middotL1234middotU1234

                                                              Choose b pivot rows

                                                              Go back to W and use these b pivot rows (move them to top do LU without pivoting)

                                                              37

                                                              Minimizing Communication in TSLU

                                                              W = W1

                                                              W2

                                                              W3

                                                              W4

                                                              LULULULU

                                                              LU

                                                              LULUParallel

                                                              W = W1

                                                              W2

                                                              W3

                                                              W4

                                                              LULU

                                                              LU

                                                              LUSequentialStreaming

                                                              W = W1

                                                              W2

                                                              W3

                                                              W4

                                                              LULU LU

                                                              LULU

                                                              LULU

                                                              Dual Core

                                                              Can choose reduction tree dynamically to match architecture as before

                                                              38

                                                              Making TSLU Numerically Stable

                                                              bull Details matterndash Going up the tree we could do LU either on original rows of A

                                                              (tournament pivoting) or computed rows of Undash Only tournament pivoting stable

                                                              bull ldquoThmrdquo New scheme as stable as Partial Pivoting (GEPP) in following sense Get same Schur complements as GEPP applied to different input matrix whose entries are blocks taken from input A

                                                              bull Why just a ldquoThmrdquo

                                                              39

                                                              Stability of LU using TSLU CALU

                                                              Summer School Lecture 4 40

                                                              bull Empirical testingndash Both random matrices and ldquospecial onesrdquondash Both binary tree (BCALU) and flat-tree (FCALU)ndash 3 metrics ||PA-LU||||A|| normwise and componentwise backward errorsndash See [D Grigori Xiang 2010] for details

                                                              Why is stability of TSLU just a ldquoThmrdquo

                                                              bull Proof is correct ndash in exact arithmeticbull Experiment

                                                              ndash Generate 100 random 6x6 rank 3 matrices in Matlabndash [LUP] = lu(A) do LU without pivoting on PA compare L factors are

                                                              they the samebull Compute || L ndash Lnp || A few 0rsquos A few infinrsquos a few NaNsbull Rest mostly O(1)

                                                              ndash Why Floating point is nonassociative doing arithmetic in different order gives different rounding errors

                                                              ndash Same experiment with rank 6 matrices || L ndash Lnp || usually nonzero O(macheps)

                                                              ndash Same experiment with 20x20 rank 4 matrices || L ndash Lnp || often O(103)

                                                              bull Much harder to break TSLU but possiblendash Occurred when using TSLU to factorize a low-rank subdiagonal

                                                              panel in symmetric-indefinite factorization 41

                                                              Fixing TSLU

                                                              bull Run TSLU quickly test for stability fix if necessary (rare)

                                                              bull Test conditioning of U if not tiny (usual case) proceed elsebull Compute || L || if not big (usual case) proceed elsebull Factor A = QR using TSQR thenbull Factor Q = PLU using TSLU thenbull A = PL(UR) with UR as upper triangular factor

                                                              bull Last topic in lecture how to guarantee floating point reproducibility

                                                              42

                                                              2D CALU with Tournament Pivoting

                                                              43

                                                              25D CALU with Tournament Pivoting (c=4 copies)

                                                              44

                                                              Exascale Machine ParametersSource DOE Exascale Workshop

                                                              bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

                                                              Exascale predicted speedupsfor Gaussian Elimination

                                                              2D CA-LU vs ScaLAPACK-LU

                                                              log2 (p)

                                                              log 2

                                                              (n2 p

                                                              ) =

                                                              log 2

                                                              (mem

                                                              ory_

                                                              per_

                                                              proc

                                                              )

                                                              Up to 29x

                                                              25D vs 2D LUWith and Without Pivoting

                                                              Other CA algorithms for Ax=b least squares(13)

                                                              bull A symmetric and indefinitendash Seek factorization that retains symmetry PAPT = LDLT D

                                                              ldquosimplerdquobull Save frac12 flops preserve inertia

                                                              ndash Usual approach Bunch-Kaufmanbull D block diagonal with 1x1 and 2x2 blocksbull Pivot search down column along row (lots of communication)

                                                              ndash Alternative Aasenbull D = tridiagonal = Tbull Two steps

                                                              ndash PAPT = LTLT where T is banded using TSLU

                                                              48

                                                              0 0

                                                              0

                                                              0 0

                                                              0

                                                              0

                                                              hellip

                                                              hellip

                                                              ndash Solvefactor narrow band problem with Tbull Up to 28x faster than MKL Best Paper at IPDPSrsquo13

                                                              Other CA algorithms for Ax=b least squares (23)bull Minimizing bandwidth and latency for sequential GEPP

                                                              ndash So far could not do partial pivoting and minimize messages just words

                                                              ndash Challengebull Column layout good for choosing pivots bad for matmulbull Blocked layout good for matmul bad for choosing pivots

                                                              ndash Solution use both layouts switching between thembull ldquoShape Morphing LUrdquo or SMLU

                                                              49

                                                              bull func factor(A) if A has 1 column update it else factor(left half of A)

                                                              update right half of A

                                                              factor(right half of A)

                                                              bull Words = O(n3M12)

                                                              bull Messages = O(n3M)

                                                              bull func factor(A) if A has 1 column update it else factor(left half of A) reshape to recursive block format update right half of A reshape to columnwise format factor(right half of A)

                                                              bull Words = O(n3M12)

                                                              bull Messages = O(n3M32)

                                                              Other CA algorithms for Ax=b least squares (33)bull Need for pivoting arises beyond LU in QR

                                                              ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)

                                                              ndash Usual approach like Partial Pivotingbull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound

                                                              ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two

                                                              groups of b columns either using usual approach or something better (GuEisenstat)

                                                              bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix

                                                              ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 50

                                                              Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                              ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                              ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                              bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                              bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                              What about sparse matrices (13)

                                                              bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

                                                              bull But canrsquot reorder outer loop for 25D need another ideabull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = AB

                                                              ndash Dependencies ok 25D works just different semiringbull Kleenersquos Algorithm

                                                              52

                                                              for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

                                                              D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

                                                              Performance of 25D APSP using Kleene

                                                              53

                                                              Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

                                                              62xspeedup

                                                              2x speedup

                                                              What about sparse matrices (23)

                                                              bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of G(A)

                                                              have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

                                                              bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering for

                                                              2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

                                                              bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

                                                              multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

                                                              separators)

                                                              54

                                                              What about sparse matrices (33)

                                                              bull If matrix stays very sparse lower bound unattainable new one bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

                                                              data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

                                                              bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

                                                              Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

                                                              bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

                                                              along dimensions most likely to minimize cost55

                                                              Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                              ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                              ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                              bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                              bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                              Symmetric Eigenproblem and SVD

                                                              bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd

                                                              bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12

                                                              ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea

                                                              b+1

                                                              b+1

                                                              Successive Band Reduction (BischofLangSun)

                                                              1

                                                              b+1

                                                              b+1

                                                              d+1

                                                              c

                                                              Successive Band Reduction (BischofLangSun)

                                                              b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                              1Q1

                                                              b+1

                                                              b+1

                                                              d+1

                                                              c

                                                              b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                              Successive Band Reduction (BischofLangSun)

                                                              12

                                                              Q1

                                                              b+1

                                                              b+1

                                                              d+1

                                                              d+c

                                                              d+c

                                                              c

                                                              b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                              Successive Band Reduction (BischofLangSun)

                                                              1

                                                              12

                                                              Q1

                                                              Q1T

                                                              b+1

                                                              b+1

                                                              d+1

                                                              d+1

                                                              cd+c

                                                              d+c

                                                              c

                                                              b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                              Successive Band Reduction (BischofLangSun)

                                                              1

                                                              1

                                                              2

                                                              2Q1

                                                              Q1T

                                                              b+1

                                                              b+1

                                                              d+1

                                                              d+1

                                                              cd+c

                                                              d+c

                                                              d+c

                                                              d+c

                                                              c

                                                              b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                              Successive Band Reduction (BischofLangSun)

                                                              1

                                                              1

                                                              2

                                                              2

                                                              3

                                                              3

                                                              Q1

                                                              Q1T

                                                              Q2

                                                              Q2T

                                                              b+1

                                                              b+1

                                                              d+1

                                                              d+1

                                                              d+c

                                                              d+c

                                                              d+c

                                                              d+c

                                                              c

                                                              c

                                                              b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                              Successive Band Reduction (BischofLangSun)

                                                              1

                                                              1

                                                              2

                                                              2

                                                              3

                                                              3

                                                              4

                                                              4

                                                              Q1

                                                              Q1T

                                                              Q2

                                                              Q2T

                                                              Q3

                                                              Q3T

                                                              b+1

                                                              b+1

                                                              d+1

                                                              d+1

                                                              d+c

                                                              d+c

                                                              d+c

                                                              d+c

                                                              c

                                                              c

                                                              b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                              Successive Band Reduction (BischofLangSun)

                                                              1

                                                              1

                                                              2

                                                              2

                                                              3

                                                              3

                                                              4

                                                              4

                                                              5

                                                              5

                                                              Q1

                                                              Q1T

                                                              Q2

                                                              Q2T

                                                              Q3

                                                              Q3T

                                                              Q4

                                                              Q4T

                                                              b+1

                                                              b+1

                                                              d+1

                                                              d+1

                                                              c

                                                              c

                                                              d+c

                                                              d+c

                                                              d+c

                                                              d+c

                                                              b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                              Successive Band Reduction (BischofLangSun)

                                                              1

                                                              1

                                                              2

                                                              2

                                                              3

                                                              3

                                                              4

                                                              4

                                                              5

                                                              5

                                                              Q5T

                                                              Q1

                                                              Q1T

                                                              Q2

                                                              Q2T

                                                              Q3

                                                              Q3T

                                                              Q5

                                                              Q4

                                                              Q4T

                                                              b+1

                                                              b+1

                                                              d+1

                                                              d+1

                                                              c

                                                              c

                                                              d+c

                                                              d+c

                                                              d+c

                                                              d+c

                                                              b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                              Successive Band Reduction (BischofLangSun)

                                                              1

                                                              1

                                                              2

                                                              2

                                                              3

                                                              3

                                                              4

                                                              4

                                                              5

                                                              5

                                                              6

                                                              6

                                                              Q5T

                                                              Q1

                                                              Q1T

                                                              Q2

                                                              Q2T

                                                              Q3

                                                              Q3T

                                                              Q5

                                                              Q4

                                                              Q4T

                                                              b+1

                                                              b+1

                                                              d+1

                                                              d+1

                                                              c

                                                              c

                                                              d+c

                                                              d+c

                                                              d+c

                                                              d+c

                                                              b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                              Successive Band Reduction (BischofLangSun)

                                                              Conventional vs CA - SBR

                                                              Conventional Communication-Avoiding

                                                              Touch all data 4 times Touch all data once

                                                              >
                                                              >

                                                              Speedups of Sym Band Reductionvs DSBTRD

                                                              bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

                                                              bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

                                                              bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

                                                              bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

                                                              bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

                                                              Nonsymmetric Eigenproblem

                                                              bull No apparent way to modify standard algorithmbull Instead Spectral Divide-and-Conquer

                                                              ndash Find orthogonal matrix Q whose leading columns span an invariant subspace of A

                                                              ndash QTAQ will be block upper triangularndash Apply recursively to A11 A22

                                                              ndash Depends on randomization1 Randomized Rank Revealing QR decomposition2 Randomized location to try splitting spectrum

                                                              A11 A12

                                                              ε A22

                                                              Attaining the Lower bounds SequentialLegend[Existing][Ours][Math-Lib][Random]

                                                              Two Levels Memory Hierarchy

                                                              Words Messages Words Messages

                                                              BLAS-3 [FLPRrsquo99][BDLSTrsquo13][MKL etc] [FLPRrsquo99][BDLSTrsquo13][MKL etc]

                                                              Cholesky[Grsquo97][APrsquo00]

                                                              [LAPACK][BDHSrsquo09]

                                                              [Grsquo97][APrsquo00][BDHSrsquo09] [Grsquo97][APrsquo00][BDHSrsquo09]

                                                              Sym Indefinite [BBDDDPSTYrsquo13] [BBDDDPSTYrsquo13]

                                                              LU[Grsquo97][Trsquo97]

                                                              [GDXrsquo11][BDLSTrsquo13]

                                                              [GDXrsquo11][BDLSTrsquo13]

                                                              [Grsquo97][Trsquo97] [BDLSTrsquo13] [BDLSTrsquo13]

                                                              QR[EGrsquo98][FWrsquo03]

                                                              [DGHLrsquo12][BDLSTrsquo13]

                                                              [FWrsquo03][DGHLrsquo12][BDLSTrsquo13]

                                                              [EGrsquo98][FWrsquo03][BDLSTrsquo13]

                                                              [FWrsquo03][BDLSTrsquo13]

                                                              Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

                                                              Sym Eig amp SVD [BDDrsquo11][BDKrsquo13] [BDDrsquo11]

                                                              Non Sym Eig [BDDrsquo11] [BDDrsquo11]

                                                              Legend[Existing][Ours][Math-Lib][Random]

                                                              Words (BW) Messages (L) Saving factor

                                                              BLAS-3 [AGZrsquo94][MTrsquo99][ScaLAPACK][Crsquo69][vGWrsquo97][SDrsquo11] L nP12

                                                              Cholesky [ScaLAPACK][Trsquo99][SDrsquo11] L nP12

                                                              Sym Indefinite [BBDDDPSTYrsquo13][ScaLAPACK] [BBDDDPSTYrsquo13] L nP12

                                                              LU [ScaLAPACK][GDXrsquo11][Trsquo99][SDrsquo11] [GDXrsquo11][Trsquo99][SDrsquo11] L nP12

                                                              QR [ScaLAPACK][DGHLrsquo12] [Trsquo99] [DGHLrsquo12][Trsquo99] L nP12

                                                              Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

                                                              Sym Eig amp SVD [BDDrsquo11][BDKrsquo13][ScaLAPACK] [BDDrsquo11][BDKrsquo13] L nP12

                                                              Non-Sym Eig [BDDrsquo11] [BDDrsquo11] BW P12 L n

                                                              Attaining with extra memory 25D M=(cn2P)

                                                              Attaining the Lower bounds Parallel 2DM=(n2P)(Ignoring poly-log(P) factors words = ( n2 P12) messages = (P12)

                                                              Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                              ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                              ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                              bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                              bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                              Avoiding Communication in Iterative Linear Algebra

                                                              bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

                                                              bull Conjugate Gradients (CG) GMRES Lanczos Arnoldi hellip bull Goal minimize communication

                                                              ndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

                                                              bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

                                                              ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

                                                              bull Lots of speed up possible (modeled and measured)ndash Price some redundant computationndash Challenges Poor partitioning Preconditioning Num Stability

                                                              75

                                                              Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                              ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                              ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                              bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                              bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                              Example The Difficulty of Tuning SpMV

                                                              bull n = 21200bull nnz = 15 M

                                                              bull Source NASA structural analysis problem (raefsky)

                                                              77

                                                              Example The Difficulty of Tuning

                                                              bull n = 21200bull nnz = 15 M

                                                              bull Source NASA structural analysis problem (raefsky)

                                                              bull 8x8 dense substructure exploit this to limit mem_refs

                                                              78

                                                              Speedups on Itanium 2 The Need for Search

                                                              Reference

                                                              Best 4x2

                                                              Mflops

                                                              Mflops

                                                              79

                                                              Register Profile Itanium 2

                                                              190 Mflops

                                                              1190 Mflops

                                                              80

                                                              Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16

                                                              Itanium 2 - 33Itanium 1 - 8

                                                              252 Mflops

                                                              122 Mflops

                                                              820 Mflops

                                                              459 Mflops

                                                              247 Mflops

                                                              107 Mflops

                                                              12 Gflops

                                                              190 Mflops

                                                              Another example of tuning challenges for SpMV

                                                              bull Ex11 matrix (fluid flow)

                                                              bull More complicated non-zero structure in general

                                                              bull N = 16614bull NNZ = 11M

                                                              82

                                                              Zoom in to top corner

                                                              bull More complicated non-zero structure in general

                                                              bull N = 16614bull NNZ = 11M

                                                              83

                                                              3x3 blocks look natural buthellip

                                                              bull Example 3x3 blockingndash Logical grid of 3x3 cells

                                                              bull But would lead to lots of ldquofill-inrdquo

                                                              84

                                                              Extra Work Can Improve Efficiency

                                                              bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

                                                              bull On Pentium III 15x speedupndash Actual mflop rate 152 = 225 higher

                                                              85

                                                              Source Accelerator Cavity Design Problem (Ko via Husbands)

                                                              86

                                                              100x100 Submatrix Along Diagonal

                                                              Summer School Lecture 787

                                                              Post-RCM Reordering

                                                              88

                                                              Effect of Combined RCM+TSP Reordering

                                                              Before Green + RedAfter Green + Blue

                                                              Summer School Lecture 789

                                                              2x speedups on Pentium 4 Power 4 hellip

                                                              Summary of Other Performance Optimizations

                                                              bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

                                                              bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

                                                              bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

                                                              90

                                                              Optimized Sparse Kernel Interface - OSKI

                                                              bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

                                                              bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

                                                              bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

                                                              software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

                                                              91

                                                              Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                              ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                              ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                              bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                              bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                              93

                                                              Example Classical Conjugate Gradient (CG)

                                                              SpMVs and dot products require communication in

                                                              each iteration

                                                              via CA Matrix Powers Kernel

                                                              Global reduction to compute G

                                                              94

                                                              Example CA-Conjugate Gradient

                                                              Local computations within inner loop require

                                                              no communication

                                                              Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                              ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                              ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                              bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                              bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                              96

                                                              Slower convergence due

                                                              to roundoff

                                                              Loss of accuracy due to roundoff

                                                              At s = 16 monomial basis is rank deficient Method breaks down

                                                              Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

                                                              CA-CG (monomial)CG

                                                              machine precision

                                                              97

                                                              Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                              ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                              ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                              bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                              bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                              What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

                                                              bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

                                                              bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

                                                              matrices

                                                              Explicit (O(nnz)) Implicit (o(nnz))

                                                              Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

                                                              Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

                                                              Indices

                                                              Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                              ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                              ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                              bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                              bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                              101

                                                              bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

                                                              ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

                                                              demanded by customers (construction engineers) otherwise they donrsquot believe results

                                                              ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

                                                              ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

                                                              Reproducible Floating Point Computation

                                                              Absolute Error for Random Vectors

                                                              Same magnitude opposite signs

                                                              Intel MKL non-reproducibility

                                                              Relative Error for Orthogonal vectors

                                                              Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

                                                              Sign notreproducible

                                                              103

                                                              bull Consider summation or dot productbull Goals

                                                              1 Same answer independent of layout processors order of summands

                                                              2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

                                                              bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

                                                              GoalsApproaches for Reproducibility

                                                              104

                                                              Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

                                                              Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

                                                              Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

                                                              bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

                                                              Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

                                                              bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

                                                              Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

                                                              Instruments NEC Nokia NVIDIA Samsung Oracle

                                                              bull bebopcsberkeleyedu

                                                              Summary

                                                              Donrsquot Communichellip

                                                              106

                                                              Time to redesign all linear algebra n-body hellip algorithms and software

                                                              (and compilers)

                                                              • Implementing Communication-Avoiding Algorithms
                                                              • Why avoid communication
                                                              • Goals
                                                              • Outline
                                                              • Outline (2)
                                                              • Lower bound for all ldquon3-likerdquo linear algebra
                                                              • Lower bound for all ldquon3-likerdquo linear algebra (2)
                                                              • Lower bound for all ldquon3-likerdquo linear algebra (3)
                                                              • Limits to parallel scaling (12)
                                                              • Limits to parallel scaling (22)
                                                              • Can we attain these lower bounds
                                                              • Outline (3)
                                                              • 25D Matrix Multiplication
                                                              • 25D Matrix Multiplication (2)
                                                              • 25D Matmul on BGP 16K nodes 64K cores (2)
                                                              • Perfect Strong Scaling ndash in Time and Energy (12)
                                                              • Perfect Strong Scaling ndash in Time and Energy (22)
                                                              • Handling Heterogeneity
                                                              • Application to Tensor Contractions
                                                              • C(ijk) = Σm A(ijm)B(mk)
                                                              • Application to Tensor Contractions (2)
                                                              • Communication Lower Bounds for Strassen-like matmul algorithms
                                                              • vs
                                                              • Slide 26
                                                              • Strassen-like beyond matmul
                                                              • Cache and Network Oblivious Algorithms
                                                              • CARMA Performance Distributed Memory
                                                              • CARMA Performance Distributed Memory (2)
                                                              • CARMA Performance Shared Memory
                                                              • CARMA Performance Shared Memory (2)
                                                              • Why is CARMA Faster in Shared Memory
                                                              • Outline (4)
                                                              • One-sided Factorizations (LU QR) so far
                                                              • TSQR An Architecture-Dependent Algorithm
                                                              • Back to LU Using similar idea for TSLU as TSQR Use reduction
                                                              • Minimizing Communication in TSLU
                                                              • Making TSLU Numerically Stable
                                                              • Stability of LU using TSLU CALU
                                                              • Why is stability of TSLU just a ldquoThmrdquo
                                                              • Fixing TSLU
                                                              • 2D CALU with Tournament Pivoting
                                                              • 25D CALU with Tournament Pivoting (c=4 copies)
                                                              • Exascale Machine Parameters Source DOE Exascale Workshop
                                                              • Exascale predicted speedups for Gaussian Elimination 2D CA
                                                              • 25D vs 2D LU With and Without Pivoting
                                                              • Other CA algorithms for Ax=b least squares(13)
                                                              • Other CA algorithms for Ax=b least squares (23)
                                                              • Other CA algorithms for Ax=b least squares (33)
                                                              • Outline (5)
                                                              • What about sparse matrices (13)
                                                              • Performance of 25D APSP using Kleene
                                                              • What about sparse matrices (23)
                                                              • What about sparse matrices (33)
                                                              • Outline (6)
                                                              • Symmetric Eigenproblem and SVD
                                                              • Slide 58
                                                              • Slide 59
                                                              • Slide 60
                                                              • Slide 61
                                                              • Slide 62
                                                              • Slide 63
                                                              • Slide 64
                                                              • Slide 65
                                                              • Slide 66
                                                              • Slide 67
                                                              • Slide 68
                                                              • Conventional vs CA - SBR
                                                              • Speedups of Sym Band Reduction vs DSBTRD
                                                              • Nonsymmetric Eigenproblem
                                                              • Attaining the Lower bounds Sequential
                                                              • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
                                                              • Outline (7)
                                                              • Avoiding Communication in Iterative Linear Algebra
                                                              • Outline (8)
                                                              • Example The Difficulty of Tuning SpMV
                                                              • Example The Difficulty of Tuning
                                                              • Speedups on Itanium 2 The Need for Search
                                                              • Register Profile Itanium 2
                                                              • Register Profiles IBM and Intel IA-64
                                                              • Another example of tuning challenges for SpMV
                                                              • Zoom in to top corner
                                                              • 3x3 blocks look natural buthellip
                                                              • Extra Work Can Improve Efficiency
                                                              • Slide 86
                                                              • Slide 87
                                                              • Slide 88
                                                              • Slide 89
                                                              • Summary of Other Performance Optimizations
                                                              • Optimized Sparse Kernel Interface - OSKI
                                                              • Outline (9)
                                                              • Example Classical Conjugate Gradient (CG)
                                                              • Example CA-Conjugate Gradient
                                                              • Outline (10)
                                                              • Slide 96
                                                              • Slide 97
                                                              • Outline (11)
                                                              • What is a ldquosparse matrixrdquo
                                                              • Outline (12)
                                                              • Reproducible Floating Point Computation
                                                              • Intel MKL non-reproducibility
                                                              • GoalsApproaches for Reproducibility
                                                              • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
                                                              • Collaborators and Supporters
                                                              • Summary

                                                                Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                One-sided Factorizations (LU QR) so farbull Classical Approach for i=1 to n update column i update trailing matrixbull words_moved = O(n3)

                                                                35

                                                                bull Blocked Approach (LAPACK) for i=1 to nb update block i of b columns update trailing matrixbull words moved = O(n3M13)

                                                                bull Recursive Approach func factor(A) if A has 1 column update it

                                                                else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)

                                                                bull None of these approaches minimizes messagesbull Parallel case Partial

                                                                Pivoting =gt n reductionsbull Need another idea

                                                                TSQR An Architecture-Dependent Algorithm

                                                                W =

                                                                W0

                                                                W1

                                                                W2

                                                                W3

                                                                R00

                                                                R10

                                                                R20

                                                                R30

                                                                R01

                                                                R11

                                                                R02Parallel

                                                                W =

                                                                W0

                                                                W1

                                                                W2

                                                                W3

                                                                R01 R02

                                                                R00

                                                                R03

                                                                SequentialStreaming

                                                                W =

                                                                W0

                                                                W1

                                                                W2

                                                                W3

                                                                R00

                                                                R01R01

                                                                R11

                                                                R02

                                                                R11

                                                                R03

                                                                Dual Core

                                                                Can choose reduction tree dynamicallyMulticore Multisocket Multirack Multisite Out-of-core

                                                                Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

                                                                Wnxb =

                                                                W1

                                                                W2

                                                                W3

                                                                W4

                                                                P1middotL1middotU1

                                                                P2middotL2middotU2

                                                                P3middotL3middotU3

                                                                P4middotL4middotU4

                                                                =

                                                                Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

                                                                W1rsquoW2rsquoW3rsquoW4rsquo

                                                                P12middotL12middotU12

                                                                P34middotL34middotU34

                                                                =Choose b pivot rows call them W12rsquo

                                                                Choose b pivot rows call them W34rsquo

                                                                W12rsquoW34rsquo

                                                                = P1234middotL1234middotU1234

                                                                Choose b pivot rows

                                                                Go back to W and use these b pivot rows (move them to top do LU without pivoting)

                                                                37

                                                                Minimizing Communication in TSLU

                                                                W = W1

                                                                W2

                                                                W3

                                                                W4

                                                                LULULULU

                                                                LU

                                                                LULUParallel

                                                                W = W1

                                                                W2

                                                                W3

                                                                W4

                                                                LULU

                                                                LU

                                                                LUSequentialStreaming

                                                                W = W1

                                                                W2

                                                                W3

                                                                W4

                                                                LULU LU

                                                                LULU

                                                                LULU

                                                                Dual Core

                                                                Can choose reduction tree dynamically to match architecture as before

                                                                38

                                                                Making TSLU Numerically Stable

                                                                bull Details matterndash Going up the tree we could do LU either on original rows of A

                                                                (tournament pivoting) or computed rows of Undash Only tournament pivoting stable

                                                                bull ldquoThmrdquo New scheme as stable as Partial Pivoting (GEPP) in following sense Get same Schur complements as GEPP applied to different input matrix whose entries are blocks taken from input A

                                                                bull Why just a ldquoThmrdquo

                                                                39

                                                                Stability of LU using TSLU CALU

                                                                Summer School Lecture 4 40

                                                                bull Empirical testingndash Both random matrices and ldquospecial onesrdquondash Both binary tree (BCALU) and flat-tree (FCALU)ndash 3 metrics ||PA-LU||||A|| normwise and componentwise backward errorsndash See [D Grigori Xiang 2010] for details

                                                                Why is stability of TSLU just a ldquoThmrdquo

                                                                bull Proof is correct ndash in exact arithmeticbull Experiment

                                                                ndash Generate 100 random 6x6 rank 3 matrices in Matlabndash [LUP] = lu(A) do LU without pivoting on PA compare L factors are

                                                                they the samebull Compute || L ndash Lnp || A few 0rsquos A few infinrsquos a few NaNsbull Rest mostly O(1)

                                                                ndash Why Floating point is nonassociative doing arithmetic in different order gives different rounding errors

                                                                ndash Same experiment with rank 6 matrices || L ndash Lnp || usually nonzero O(macheps)

                                                                ndash Same experiment with 20x20 rank 4 matrices || L ndash Lnp || often O(103)

                                                                bull Much harder to break TSLU but possiblendash Occurred when using TSLU to factorize a low-rank subdiagonal

                                                                panel in symmetric-indefinite factorization 41

                                                                Fixing TSLU

                                                                bull Run TSLU quickly test for stability fix if necessary (rare)

                                                                bull Test conditioning of U if not tiny (usual case) proceed elsebull Compute || L || if not big (usual case) proceed elsebull Factor A = QR using TSQR thenbull Factor Q = PLU using TSLU thenbull A = PL(UR) with UR as upper triangular factor

                                                                bull Last topic in lecture how to guarantee floating point reproducibility

                                                                42

                                                                2D CALU with Tournament Pivoting

                                                                43

                                                                25D CALU with Tournament Pivoting (c=4 copies)

                                                                44

                                                                Exascale Machine ParametersSource DOE Exascale Workshop

                                                                bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

                                                                Exascale predicted speedupsfor Gaussian Elimination

                                                                2D CA-LU vs ScaLAPACK-LU

                                                                log2 (p)

                                                                log 2

                                                                (n2 p

                                                                ) =

                                                                log 2

                                                                (mem

                                                                ory_

                                                                per_

                                                                proc

                                                                )

                                                                Up to 29x

                                                                25D vs 2D LUWith and Without Pivoting

                                                                Other CA algorithms for Ax=b least squares(13)

                                                                bull A symmetric and indefinitendash Seek factorization that retains symmetry PAPT = LDLT D

                                                                ldquosimplerdquobull Save frac12 flops preserve inertia

                                                                ndash Usual approach Bunch-Kaufmanbull D block diagonal with 1x1 and 2x2 blocksbull Pivot search down column along row (lots of communication)

                                                                ndash Alternative Aasenbull D = tridiagonal = Tbull Two steps

                                                                ndash PAPT = LTLT where T is banded using TSLU

                                                                48

                                                                0 0

                                                                0

                                                                0 0

                                                                0

                                                                0

                                                                hellip

                                                                hellip

                                                                ndash Solvefactor narrow band problem with Tbull Up to 28x faster than MKL Best Paper at IPDPSrsquo13

                                                                Other CA algorithms for Ax=b least squares (23)bull Minimizing bandwidth and latency for sequential GEPP

                                                                ndash So far could not do partial pivoting and minimize messages just words

                                                                ndash Challengebull Column layout good for choosing pivots bad for matmulbull Blocked layout good for matmul bad for choosing pivots

                                                                ndash Solution use both layouts switching between thembull ldquoShape Morphing LUrdquo or SMLU

                                                                49

                                                                bull func factor(A) if A has 1 column update it else factor(left half of A)

                                                                update right half of A

                                                                factor(right half of A)

                                                                bull Words = O(n3M12)

                                                                bull Messages = O(n3M)

                                                                bull func factor(A) if A has 1 column update it else factor(left half of A) reshape to recursive block format update right half of A reshape to columnwise format factor(right half of A)

                                                                bull Words = O(n3M12)

                                                                bull Messages = O(n3M32)

                                                                Other CA algorithms for Ax=b least squares (33)bull Need for pivoting arises beyond LU in QR

                                                                ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)

                                                                ndash Usual approach like Partial Pivotingbull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound

                                                                ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two

                                                                groups of b columns either using usual approach or something better (GuEisenstat)

                                                                bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix

                                                                ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 50

                                                                Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                What about sparse matrices (13)

                                                                bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

                                                                bull But canrsquot reorder outer loop for 25D need another ideabull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = AB

                                                                ndash Dependencies ok 25D works just different semiringbull Kleenersquos Algorithm

                                                                52

                                                                for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

                                                                D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

                                                                Performance of 25D APSP using Kleene

                                                                53

                                                                Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

                                                                62xspeedup

                                                                2x speedup

                                                                What about sparse matrices (23)

                                                                bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of G(A)

                                                                have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

                                                                bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering for

                                                                2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

                                                                bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

                                                                multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

                                                                separators)

                                                                54

                                                                What about sparse matrices (33)

                                                                bull If matrix stays very sparse lower bound unattainable new one bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

                                                                data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

                                                                bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

                                                                Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

                                                                bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

                                                                along dimensions most likely to minimize cost55

                                                                Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                Symmetric Eigenproblem and SVD

                                                                bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd

                                                                bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12

                                                                ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea

                                                                b+1

                                                                b+1

                                                                Successive Band Reduction (BischofLangSun)

                                                                1

                                                                b+1

                                                                b+1

                                                                d+1

                                                                c

                                                                Successive Band Reduction (BischofLangSun)

                                                                b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                1Q1

                                                                b+1

                                                                b+1

                                                                d+1

                                                                c

                                                                b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                Successive Band Reduction (BischofLangSun)

                                                                12

                                                                Q1

                                                                b+1

                                                                b+1

                                                                d+1

                                                                d+c

                                                                d+c

                                                                c

                                                                b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                Successive Band Reduction (BischofLangSun)

                                                                1

                                                                12

                                                                Q1

                                                                Q1T

                                                                b+1

                                                                b+1

                                                                d+1

                                                                d+1

                                                                cd+c

                                                                d+c

                                                                c

                                                                b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                Successive Band Reduction (BischofLangSun)

                                                                1

                                                                1

                                                                2

                                                                2Q1

                                                                Q1T

                                                                b+1

                                                                b+1

                                                                d+1

                                                                d+1

                                                                cd+c

                                                                d+c

                                                                d+c

                                                                d+c

                                                                c

                                                                b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                Successive Band Reduction (BischofLangSun)

                                                                1

                                                                1

                                                                2

                                                                2

                                                                3

                                                                3

                                                                Q1

                                                                Q1T

                                                                Q2

                                                                Q2T

                                                                b+1

                                                                b+1

                                                                d+1

                                                                d+1

                                                                d+c

                                                                d+c

                                                                d+c

                                                                d+c

                                                                c

                                                                c

                                                                b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                Successive Band Reduction (BischofLangSun)

                                                                1

                                                                1

                                                                2

                                                                2

                                                                3

                                                                3

                                                                4

                                                                4

                                                                Q1

                                                                Q1T

                                                                Q2

                                                                Q2T

                                                                Q3

                                                                Q3T

                                                                b+1

                                                                b+1

                                                                d+1

                                                                d+1

                                                                d+c

                                                                d+c

                                                                d+c

                                                                d+c

                                                                c

                                                                c

                                                                b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                Successive Band Reduction (BischofLangSun)

                                                                1

                                                                1

                                                                2

                                                                2

                                                                3

                                                                3

                                                                4

                                                                4

                                                                5

                                                                5

                                                                Q1

                                                                Q1T

                                                                Q2

                                                                Q2T

                                                                Q3

                                                                Q3T

                                                                Q4

                                                                Q4T

                                                                b+1

                                                                b+1

                                                                d+1

                                                                d+1

                                                                c

                                                                c

                                                                d+c

                                                                d+c

                                                                d+c

                                                                d+c

                                                                b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                Successive Band Reduction (BischofLangSun)

                                                                1

                                                                1

                                                                2

                                                                2

                                                                3

                                                                3

                                                                4

                                                                4

                                                                5

                                                                5

                                                                Q5T

                                                                Q1

                                                                Q1T

                                                                Q2

                                                                Q2T

                                                                Q3

                                                                Q3T

                                                                Q5

                                                                Q4

                                                                Q4T

                                                                b+1

                                                                b+1

                                                                d+1

                                                                d+1

                                                                c

                                                                c

                                                                d+c

                                                                d+c

                                                                d+c

                                                                d+c

                                                                b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                Successive Band Reduction (BischofLangSun)

                                                                1

                                                                1

                                                                2

                                                                2

                                                                3

                                                                3

                                                                4

                                                                4

                                                                5

                                                                5

                                                                6

                                                                6

                                                                Q5T

                                                                Q1

                                                                Q1T

                                                                Q2

                                                                Q2T

                                                                Q3

                                                                Q3T

                                                                Q5

                                                                Q4

                                                                Q4T

                                                                b+1

                                                                b+1

                                                                d+1

                                                                d+1

                                                                c

                                                                c

                                                                d+c

                                                                d+c

                                                                d+c

                                                                d+c

                                                                b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                Successive Band Reduction (BischofLangSun)

                                                                Conventional vs CA - SBR

                                                                Conventional Communication-Avoiding

                                                                Touch all data 4 times Touch all data once

                                                                >
                                                                >

                                                                Speedups of Sym Band Reductionvs DSBTRD

                                                                bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

                                                                bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

                                                                bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

                                                                bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

                                                                bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

                                                                Nonsymmetric Eigenproblem

                                                                bull No apparent way to modify standard algorithmbull Instead Spectral Divide-and-Conquer

                                                                ndash Find orthogonal matrix Q whose leading columns span an invariant subspace of A

                                                                ndash QTAQ will be block upper triangularndash Apply recursively to A11 A22

                                                                ndash Depends on randomization1 Randomized Rank Revealing QR decomposition2 Randomized location to try splitting spectrum

                                                                A11 A12

                                                                ε A22

                                                                Attaining the Lower bounds SequentialLegend[Existing][Ours][Math-Lib][Random]

                                                                Two Levels Memory Hierarchy

                                                                Words Messages Words Messages

                                                                BLAS-3 [FLPRrsquo99][BDLSTrsquo13][MKL etc] [FLPRrsquo99][BDLSTrsquo13][MKL etc]

                                                                Cholesky[Grsquo97][APrsquo00]

                                                                [LAPACK][BDHSrsquo09]

                                                                [Grsquo97][APrsquo00][BDHSrsquo09] [Grsquo97][APrsquo00][BDHSrsquo09]

                                                                Sym Indefinite [BBDDDPSTYrsquo13] [BBDDDPSTYrsquo13]

                                                                LU[Grsquo97][Trsquo97]

                                                                [GDXrsquo11][BDLSTrsquo13]

                                                                [GDXrsquo11][BDLSTrsquo13]

                                                                [Grsquo97][Trsquo97] [BDLSTrsquo13] [BDLSTrsquo13]

                                                                QR[EGrsquo98][FWrsquo03]

                                                                [DGHLrsquo12][BDLSTrsquo13]

                                                                [FWrsquo03][DGHLrsquo12][BDLSTrsquo13]

                                                                [EGrsquo98][FWrsquo03][BDLSTrsquo13]

                                                                [FWrsquo03][BDLSTrsquo13]

                                                                Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

                                                                Sym Eig amp SVD [BDDrsquo11][BDKrsquo13] [BDDrsquo11]

                                                                Non Sym Eig [BDDrsquo11] [BDDrsquo11]

                                                                Legend[Existing][Ours][Math-Lib][Random]

                                                                Words (BW) Messages (L) Saving factor

                                                                BLAS-3 [AGZrsquo94][MTrsquo99][ScaLAPACK][Crsquo69][vGWrsquo97][SDrsquo11] L nP12

                                                                Cholesky [ScaLAPACK][Trsquo99][SDrsquo11] L nP12

                                                                Sym Indefinite [BBDDDPSTYrsquo13][ScaLAPACK] [BBDDDPSTYrsquo13] L nP12

                                                                LU [ScaLAPACK][GDXrsquo11][Trsquo99][SDrsquo11] [GDXrsquo11][Trsquo99][SDrsquo11] L nP12

                                                                QR [ScaLAPACK][DGHLrsquo12] [Trsquo99] [DGHLrsquo12][Trsquo99] L nP12

                                                                Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

                                                                Sym Eig amp SVD [BDDrsquo11][BDKrsquo13][ScaLAPACK] [BDDrsquo11][BDKrsquo13] L nP12

                                                                Non-Sym Eig [BDDrsquo11] [BDDrsquo11] BW P12 L n

                                                                Attaining with extra memory 25D M=(cn2P)

                                                                Attaining the Lower bounds Parallel 2DM=(n2P)(Ignoring poly-log(P) factors words = ( n2 P12) messages = (P12)

                                                                Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                Avoiding Communication in Iterative Linear Algebra

                                                                bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

                                                                bull Conjugate Gradients (CG) GMRES Lanczos Arnoldi hellip bull Goal minimize communication

                                                                ndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

                                                                bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

                                                                ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

                                                                bull Lots of speed up possible (modeled and measured)ndash Price some redundant computationndash Challenges Poor partitioning Preconditioning Num Stability

                                                                75

                                                                Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                Example The Difficulty of Tuning SpMV

                                                                bull n = 21200bull nnz = 15 M

                                                                bull Source NASA structural analysis problem (raefsky)

                                                                77

                                                                Example The Difficulty of Tuning

                                                                bull n = 21200bull nnz = 15 M

                                                                bull Source NASA structural analysis problem (raefsky)

                                                                bull 8x8 dense substructure exploit this to limit mem_refs

                                                                78

                                                                Speedups on Itanium 2 The Need for Search

                                                                Reference

                                                                Best 4x2

                                                                Mflops

                                                                Mflops

                                                                79

                                                                Register Profile Itanium 2

                                                                190 Mflops

                                                                1190 Mflops

                                                                80

                                                                Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16

                                                                Itanium 2 - 33Itanium 1 - 8

                                                                252 Mflops

                                                                122 Mflops

                                                                820 Mflops

                                                                459 Mflops

                                                                247 Mflops

                                                                107 Mflops

                                                                12 Gflops

                                                                190 Mflops

                                                                Another example of tuning challenges for SpMV

                                                                bull Ex11 matrix (fluid flow)

                                                                bull More complicated non-zero structure in general

                                                                bull N = 16614bull NNZ = 11M

                                                                82

                                                                Zoom in to top corner

                                                                bull More complicated non-zero structure in general

                                                                bull N = 16614bull NNZ = 11M

                                                                83

                                                                3x3 blocks look natural buthellip

                                                                bull Example 3x3 blockingndash Logical grid of 3x3 cells

                                                                bull But would lead to lots of ldquofill-inrdquo

                                                                84

                                                                Extra Work Can Improve Efficiency

                                                                bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

                                                                bull On Pentium III 15x speedupndash Actual mflop rate 152 = 225 higher

                                                                85

                                                                Source Accelerator Cavity Design Problem (Ko via Husbands)

                                                                86

                                                                100x100 Submatrix Along Diagonal

                                                                Summer School Lecture 787

                                                                Post-RCM Reordering

                                                                88

                                                                Effect of Combined RCM+TSP Reordering

                                                                Before Green + RedAfter Green + Blue

                                                                Summer School Lecture 789

                                                                2x speedups on Pentium 4 Power 4 hellip

                                                                Summary of Other Performance Optimizations

                                                                bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

                                                                bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

                                                                bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

                                                                90

                                                                Optimized Sparse Kernel Interface - OSKI

                                                                bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

                                                                bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

                                                                bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

                                                                software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

                                                                91

                                                                Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                93

                                                                Example Classical Conjugate Gradient (CG)

                                                                SpMVs and dot products require communication in

                                                                each iteration

                                                                via CA Matrix Powers Kernel

                                                                Global reduction to compute G

                                                                94

                                                                Example CA-Conjugate Gradient

                                                                Local computations within inner loop require

                                                                no communication

                                                                Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                96

                                                                Slower convergence due

                                                                to roundoff

                                                                Loss of accuracy due to roundoff

                                                                At s = 16 monomial basis is rank deficient Method breaks down

                                                                Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

                                                                CA-CG (monomial)CG

                                                                machine precision

                                                                97

                                                                Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

                                                                bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

                                                                bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

                                                                matrices

                                                                Explicit (O(nnz)) Implicit (o(nnz))

                                                                Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

                                                                Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

                                                                Indices

                                                                Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                101

                                                                bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

                                                                ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

                                                                demanded by customers (construction engineers) otherwise they donrsquot believe results

                                                                ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

                                                                ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

                                                                Reproducible Floating Point Computation

                                                                Absolute Error for Random Vectors

                                                                Same magnitude opposite signs

                                                                Intel MKL non-reproducibility

                                                                Relative Error for Orthogonal vectors

                                                                Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

                                                                Sign notreproducible

                                                                103

                                                                bull Consider summation or dot productbull Goals

                                                                1 Same answer independent of layout processors order of summands

                                                                2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

                                                                bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

                                                                GoalsApproaches for Reproducibility

                                                                104

                                                                Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

                                                                Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

                                                                Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

                                                                bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

                                                                Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

                                                                bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

                                                                Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

                                                                Instruments NEC Nokia NVIDIA Samsung Oracle

                                                                bull bebopcsberkeleyedu

                                                                Summary

                                                                Donrsquot Communichellip

                                                                106

                                                                Time to redesign all linear algebra n-body hellip algorithms and software

                                                                (and compilers)

                                                                • Implementing Communication-Avoiding Algorithms
                                                                • Why avoid communication
                                                                • Goals
                                                                • Outline
                                                                • Outline (2)
                                                                • Lower bound for all ldquon3-likerdquo linear algebra
                                                                • Lower bound for all ldquon3-likerdquo linear algebra (2)
                                                                • Lower bound for all ldquon3-likerdquo linear algebra (3)
                                                                • Limits to parallel scaling (12)
                                                                • Limits to parallel scaling (22)
                                                                • Can we attain these lower bounds
                                                                • Outline (3)
                                                                • 25D Matrix Multiplication
                                                                • 25D Matrix Multiplication (2)
                                                                • 25D Matmul on BGP 16K nodes 64K cores (2)
                                                                • Perfect Strong Scaling ndash in Time and Energy (12)
                                                                • Perfect Strong Scaling ndash in Time and Energy (22)
                                                                • Handling Heterogeneity
                                                                • Application to Tensor Contractions
                                                                • C(ijk) = Σm A(ijm)B(mk)
                                                                • Application to Tensor Contractions (2)
                                                                • Communication Lower Bounds for Strassen-like matmul algorithms
                                                                • vs
                                                                • Slide 26
                                                                • Strassen-like beyond matmul
                                                                • Cache and Network Oblivious Algorithms
                                                                • CARMA Performance Distributed Memory
                                                                • CARMA Performance Distributed Memory (2)
                                                                • CARMA Performance Shared Memory
                                                                • CARMA Performance Shared Memory (2)
                                                                • Why is CARMA Faster in Shared Memory
                                                                • Outline (4)
                                                                • One-sided Factorizations (LU QR) so far
                                                                • TSQR An Architecture-Dependent Algorithm
                                                                • Back to LU Using similar idea for TSLU as TSQR Use reduction
                                                                • Minimizing Communication in TSLU
                                                                • Making TSLU Numerically Stable
                                                                • Stability of LU using TSLU CALU
                                                                • Why is stability of TSLU just a ldquoThmrdquo
                                                                • Fixing TSLU
                                                                • 2D CALU with Tournament Pivoting
                                                                • 25D CALU with Tournament Pivoting (c=4 copies)
                                                                • Exascale Machine Parameters Source DOE Exascale Workshop
                                                                • Exascale predicted speedups for Gaussian Elimination 2D CA
                                                                • 25D vs 2D LU With and Without Pivoting
                                                                • Other CA algorithms for Ax=b least squares(13)
                                                                • Other CA algorithms for Ax=b least squares (23)
                                                                • Other CA algorithms for Ax=b least squares (33)
                                                                • Outline (5)
                                                                • What about sparse matrices (13)
                                                                • Performance of 25D APSP using Kleene
                                                                • What about sparse matrices (23)
                                                                • What about sparse matrices (33)
                                                                • Outline (6)
                                                                • Symmetric Eigenproblem and SVD
                                                                • Slide 58
                                                                • Slide 59
                                                                • Slide 60
                                                                • Slide 61
                                                                • Slide 62
                                                                • Slide 63
                                                                • Slide 64
                                                                • Slide 65
                                                                • Slide 66
                                                                • Slide 67
                                                                • Slide 68
                                                                • Conventional vs CA - SBR
                                                                • Speedups of Sym Band Reduction vs DSBTRD
                                                                • Nonsymmetric Eigenproblem
                                                                • Attaining the Lower bounds Sequential
                                                                • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
                                                                • Outline (7)
                                                                • Avoiding Communication in Iterative Linear Algebra
                                                                • Outline (8)
                                                                • Example The Difficulty of Tuning SpMV
                                                                • Example The Difficulty of Tuning
                                                                • Speedups on Itanium 2 The Need for Search
                                                                • Register Profile Itanium 2
                                                                • Register Profiles IBM and Intel IA-64
                                                                • Another example of tuning challenges for SpMV
                                                                • Zoom in to top corner
                                                                • 3x3 blocks look natural buthellip
                                                                • Extra Work Can Improve Efficiency
                                                                • Slide 86
                                                                • Slide 87
                                                                • Slide 88
                                                                • Slide 89
                                                                • Summary of Other Performance Optimizations
                                                                • Optimized Sparse Kernel Interface - OSKI
                                                                • Outline (9)
                                                                • Example Classical Conjugate Gradient (CG)
                                                                • Example CA-Conjugate Gradient
                                                                • Outline (10)
                                                                • Slide 96
                                                                • Slide 97
                                                                • Outline (11)
                                                                • What is a ldquosparse matrixrdquo
                                                                • Outline (12)
                                                                • Reproducible Floating Point Computation
                                                                • Intel MKL non-reproducibility
                                                                • GoalsApproaches for Reproducibility
                                                                • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
                                                                • Collaborators and Supporters
                                                                • Summary

                                                                  One-sided Factorizations (LU QR) so farbull Classical Approach for i=1 to n update column i update trailing matrixbull words_moved = O(n3)

                                                                  35

                                                                  bull Blocked Approach (LAPACK) for i=1 to nb update block i of b columns update trailing matrixbull words moved = O(n3M13)

                                                                  bull Recursive Approach func factor(A) if A has 1 column update it

                                                                  else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)

                                                                  bull None of these approaches minimizes messagesbull Parallel case Partial

                                                                  Pivoting =gt n reductionsbull Need another idea

                                                                  TSQR An Architecture-Dependent Algorithm

                                                                  W =

                                                                  W0

                                                                  W1

                                                                  W2

                                                                  W3

                                                                  R00

                                                                  R10

                                                                  R20

                                                                  R30

                                                                  R01

                                                                  R11

                                                                  R02Parallel

                                                                  W =

                                                                  W0

                                                                  W1

                                                                  W2

                                                                  W3

                                                                  R01 R02

                                                                  R00

                                                                  R03

                                                                  SequentialStreaming

                                                                  W =

                                                                  W0

                                                                  W1

                                                                  W2

                                                                  W3

                                                                  R00

                                                                  R01R01

                                                                  R11

                                                                  R02

                                                                  R11

                                                                  R03

                                                                  Dual Core

                                                                  Can choose reduction tree dynamicallyMulticore Multisocket Multirack Multisite Out-of-core

                                                                  Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

                                                                  Wnxb =

                                                                  W1

                                                                  W2

                                                                  W3

                                                                  W4

                                                                  P1middotL1middotU1

                                                                  P2middotL2middotU2

                                                                  P3middotL3middotU3

                                                                  P4middotL4middotU4

                                                                  =

                                                                  Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

                                                                  W1rsquoW2rsquoW3rsquoW4rsquo

                                                                  P12middotL12middotU12

                                                                  P34middotL34middotU34

                                                                  =Choose b pivot rows call them W12rsquo

                                                                  Choose b pivot rows call them W34rsquo

                                                                  W12rsquoW34rsquo

                                                                  = P1234middotL1234middotU1234

                                                                  Choose b pivot rows

                                                                  Go back to W and use these b pivot rows (move them to top do LU without pivoting)

                                                                  37

                                                                  Minimizing Communication in TSLU

                                                                  W = W1

                                                                  W2

                                                                  W3

                                                                  W4

                                                                  LULULULU

                                                                  LU

                                                                  LULUParallel

                                                                  W = W1

                                                                  W2

                                                                  W3

                                                                  W4

                                                                  LULU

                                                                  LU

                                                                  LUSequentialStreaming

                                                                  W = W1

                                                                  W2

                                                                  W3

                                                                  W4

                                                                  LULU LU

                                                                  LULU

                                                                  LULU

                                                                  Dual Core

                                                                  Can choose reduction tree dynamically to match architecture as before

                                                                  38

                                                                  Making TSLU Numerically Stable

                                                                  bull Details matterndash Going up the tree we could do LU either on original rows of A

                                                                  (tournament pivoting) or computed rows of Undash Only tournament pivoting stable

                                                                  bull ldquoThmrdquo New scheme as stable as Partial Pivoting (GEPP) in following sense Get same Schur complements as GEPP applied to different input matrix whose entries are blocks taken from input A

                                                                  bull Why just a ldquoThmrdquo

                                                                  39

                                                                  Stability of LU using TSLU CALU

                                                                  Summer School Lecture 4 40

                                                                  bull Empirical testingndash Both random matrices and ldquospecial onesrdquondash Both binary tree (BCALU) and flat-tree (FCALU)ndash 3 metrics ||PA-LU||||A|| normwise and componentwise backward errorsndash See [D Grigori Xiang 2010] for details

                                                                  Why is stability of TSLU just a ldquoThmrdquo

                                                                  bull Proof is correct ndash in exact arithmeticbull Experiment

                                                                  ndash Generate 100 random 6x6 rank 3 matrices in Matlabndash [LUP] = lu(A) do LU without pivoting on PA compare L factors are

                                                                  they the samebull Compute || L ndash Lnp || A few 0rsquos A few infinrsquos a few NaNsbull Rest mostly O(1)

                                                                  ndash Why Floating point is nonassociative doing arithmetic in different order gives different rounding errors

                                                                  ndash Same experiment with rank 6 matrices || L ndash Lnp || usually nonzero O(macheps)

                                                                  ndash Same experiment with 20x20 rank 4 matrices || L ndash Lnp || often O(103)

                                                                  bull Much harder to break TSLU but possiblendash Occurred when using TSLU to factorize a low-rank subdiagonal

                                                                  panel in symmetric-indefinite factorization 41

                                                                  Fixing TSLU

                                                                  bull Run TSLU quickly test for stability fix if necessary (rare)

                                                                  bull Test conditioning of U if not tiny (usual case) proceed elsebull Compute || L || if not big (usual case) proceed elsebull Factor A = QR using TSQR thenbull Factor Q = PLU using TSLU thenbull A = PL(UR) with UR as upper triangular factor

                                                                  bull Last topic in lecture how to guarantee floating point reproducibility

                                                                  42

                                                                  2D CALU with Tournament Pivoting

                                                                  43

                                                                  25D CALU with Tournament Pivoting (c=4 copies)

                                                                  44

                                                                  Exascale Machine ParametersSource DOE Exascale Workshop

                                                                  bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

                                                                  Exascale predicted speedupsfor Gaussian Elimination

                                                                  2D CA-LU vs ScaLAPACK-LU

                                                                  log2 (p)

                                                                  log 2

                                                                  (n2 p

                                                                  ) =

                                                                  log 2

                                                                  (mem

                                                                  ory_

                                                                  per_

                                                                  proc

                                                                  )

                                                                  Up to 29x

                                                                  25D vs 2D LUWith and Without Pivoting

                                                                  Other CA algorithms for Ax=b least squares(13)

                                                                  bull A symmetric and indefinitendash Seek factorization that retains symmetry PAPT = LDLT D

                                                                  ldquosimplerdquobull Save frac12 flops preserve inertia

                                                                  ndash Usual approach Bunch-Kaufmanbull D block diagonal with 1x1 and 2x2 blocksbull Pivot search down column along row (lots of communication)

                                                                  ndash Alternative Aasenbull D = tridiagonal = Tbull Two steps

                                                                  ndash PAPT = LTLT where T is banded using TSLU

                                                                  48

                                                                  0 0

                                                                  0

                                                                  0 0

                                                                  0

                                                                  0

                                                                  hellip

                                                                  hellip

                                                                  ndash Solvefactor narrow band problem with Tbull Up to 28x faster than MKL Best Paper at IPDPSrsquo13

                                                                  Other CA algorithms for Ax=b least squares (23)bull Minimizing bandwidth and latency for sequential GEPP

                                                                  ndash So far could not do partial pivoting and minimize messages just words

                                                                  ndash Challengebull Column layout good for choosing pivots bad for matmulbull Blocked layout good for matmul bad for choosing pivots

                                                                  ndash Solution use both layouts switching between thembull ldquoShape Morphing LUrdquo or SMLU

                                                                  49

                                                                  bull func factor(A) if A has 1 column update it else factor(left half of A)

                                                                  update right half of A

                                                                  factor(right half of A)

                                                                  bull Words = O(n3M12)

                                                                  bull Messages = O(n3M)

                                                                  bull func factor(A) if A has 1 column update it else factor(left half of A) reshape to recursive block format update right half of A reshape to columnwise format factor(right half of A)

                                                                  bull Words = O(n3M12)

                                                                  bull Messages = O(n3M32)

                                                                  Other CA algorithms for Ax=b least squares (33)bull Need for pivoting arises beyond LU in QR

                                                                  ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)

                                                                  ndash Usual approach like Partial Pivotingbull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound

                                                                  ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two

                                                                  groups of b columns either using usual approach or something better (GuEisenstat)

                                                                  bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix

                                                                  ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 50

                                                                  Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                  ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                  ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                  bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                  bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                  What about sparse matrices (13)

                                                                  bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

                                                                  bull But canrsquot reorder outer loop for 25D need another ideabull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = AB

                                                                  ndash Dependencies ok 25D works just different semiringbull Kleenersquos Algorithm

                                                                  52

                                                                  for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

                                                                  D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

                                                                  Performance of 25D APSP using Kleene

                                                                  53

                                                                  Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

                                                                  62xspeedup

                                                                  2x speedup

                                                                  What about sparse matrices (23)

                                                                  bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of G(A)

                                                                  have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

                                                                  bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering for

                                                                  2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

                                                                  bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

                                                                  multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

                                                                  separators)

                                                                  54

                                                                  What about sparse matrices (33)

                                                                  bull If matrix stays very sparse lower bound unattainable new one bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

                                                                  data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

                                                                  bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

                                                                  Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

                                                                  bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

                                                                  along dimensions most likely to minimize cost55

                                                                  Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                  ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                  ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                  bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                  bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                  Symmetric Eigenproblem and SVD

                                                                  bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd

                                                                  bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12

                                                                  ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea

                                                                  b+1

                                                                  b+1

                                                                  Successive Band Reduction (BischofLangSun)

                                                                  1

                                                                  b+1

                                                                  b+1

                                                                  d+1

                                                                  c

                                                                  Successive Band Reduction (BischofLangSun)

                                                                  b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                  1Q1

                                                                  b+1

                                                                  b+1

                                                                  d+1

                                                                  c

                                                                  b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                  Successive Band Reduction (BischofLangSun)

                                                                  12

                                                                  Q1

                                                                  b+1

                                                                  b+1

                                                                  d+1

                                                                  d+c

                                                                  d+c

                                                                  c

                                                                  b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                  Successive Band Reduction (BischofLangSun)

                                                                  1

                                                                  12

                                                                  Q1

                                                                  Q1T

                                                                  b+1

                                                                  b+1

                                                                  d+1

                                                                  d+1

                                                                  cd+c

                                                                  d+c

                                                                  c

                                                                  b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                  Successive Band Reduction (BischofLangSun)

                                                                  1

                                                                  1

                                                                  2

                                                                  2Q1

                                                                  Q1T

                                                                  b+1

                                                                  b+1

                                                                  d+1

                                                                  d+1

                                                                  cd+c

                                                                  d+c

                                                                  d+c

                                                                  d+c

                                                                  c

                                                                  b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                  Successive Band Reduction (BischofLangSun)

                                                                  1

                                                                  1

                                                                  2

                                                                  2

                                                                  3

                                                                  3

                                                                  Q1

                                                                  Q1T

                                                                  Q2

                                                                  Q2T

                                                                  b+1

                                                                  b+1

                                                                  d+1

                                                                  d+1

                                                                  d+c

                                                                  d+c

                                                                  d+c

                                                                  d+c

                                                                  c

                                                                  c

                                                                  b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                  Successive Band Reduction (BischofLangSun)

                                                                  1

                                                                  1

                                                                  2

                                                                  2

                                                                  3

                                                                  3

                                                                  4

                                                                  4

                                                                  Q1

                                                                  Q1T

                                                                  Q2

                                                                  Q2T

                                                                  Q3

                                                                  Q3T

                                                                  b+1

                                                                  b+1

                                                                  d+1

                                                                  d+1

                                                                  d+c

                                                                  d+c

                                                                  d+c

                                                                  d+c

                                                                  c

                                                                  c

                                                                  b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                  Successive Band Reduction (BischofLangSun)

                                                                  1

                                                                  1

                                                                  2

                                                                  2

                                                                  3

                                                                  3

                                                                  4

                                                                  4

                                                                  5

                                                                  5

                                                                  Q1

                                                                  Q1T

                                                                  Q2

                                                                  Q2T

                                                                  Q3

                                                                  Q3T

                                                                  Q4

                                                                  Q4T

                                                                  b+1

                                                                  b+1

                                                                  d+1

                                                                  d+1

                                                                  c

                                                                  c

                                                                  d+c

                                                                  d+c

                                                                  d+c

                                                                  d+c

                                                                  b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                  Successive Band Reduction (BischofLangSun)

                                                                  1

                                                                  1

                                                                  2

                                                                  2

                                                                  3

                                                                  3

                                                                  4

                                                                  4

                                                                  5

                                                                  5

                                                                  Q5T

                                                                  Q1

                                                                  Q1T

                                                                  Q2

                                                                  Q2T

                                                                  Q3

                                                                  Q3T

                                                                  Q5

                                                                  Q4

                                                                  Q4T

                                                                  b+1

                                                                  b+1

                                                                  d+1

                                                                  d+1

                                                                  c

                                                                  c

                                                                  d+c

                                                                  d+c

                                                                  d+c

                                                                  d+c

                                                                  b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                  Successive Band Reduction (BischofLangSun)

                                                                  1

                                                                  1

                                                                  2

                                                                  2

                                                                  3

                                                                  3

                                                                  4

                                                                  4

                                                                  5

                                                                  5

                                                                  6

                                                                  6

                                                                  Q5T

                                                                  Q1

                                                                  Q1T

                                                                  Q2

                                                                  Q2T

                                                                  Q3

                                                                  Q3T

                                                                  Q5

                                                                  Q4

                                                                  Q4T

                                                                  b+1

                                                                  b+1

                                                                  d+1

                                                                  d+1

                                                                  c

                                                                  c

                                                                  d+c

                                                                  d+c

                                                                  d+c

                                                                  d+c

                                                                  b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                  Successive Band Reduction (BischofLangSun)

                                                                  Conventional vs CA - SBR

                                                                  Conventional Communication-Avoiding

                                                                  Touch all data 4 times Touch all data once

                                                                  >
                                                                  >

                                                                  Speedups of Sym Band Reductionvs DSBTRD

                                                                  bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

                                                                  bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

                                                                  bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

                                                                  bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

                                                                  bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

                                                                  Nonsymmetric Eigenproblem

                                                                  bull No apparent way to modify standard algorithmbull Instead Spectral Divide-and-Conquer

                                                                  ndash Find orthogonal matrix Q whose leading columns span an invariant subspace of A

                                                                  ndash QTAQ will be block upper triangularndash Apply recursively to A11 A22

                                                                  ndash Depends on randomization1 Randomized Rank Revealing QR decomposition2 Randomized location to try splitting spectrum

                                                                  A11 A12

                                                                  ε A22

                                                                  Attaining the Lower bounds SequentialLegend[Existing][Ours][Math-Lib][Random]

                                                                  Two Levels Memory Hierarchy

                                                                  Words Messages Words Messages

                                                                  BLAS-3 [FLPRrsquo99][BDLSTrsquo13][MKL etc] [FLPRrsquo99][BDLSTrsquo13][MKL etc]

                                                                  Cholesky[Grsquo97][APrsquo00]

                                                                  [LAPACK][BDHSrsquo09]

                                                                  [Grsquo97][APrsquo00][BDHSrsquo09] [Grsquo97][APrsquo00][BDHSrsquo09]

                                                                  Sym Indefinite [BBDDDPSTYrsquo13] [BBDDDPSTYrsquo13]

                                                                  LU[Grsquo97][Trsquo97]

                                                                  [GDXrsquo11][BDLSTrsquo13]

                                                                  [GDXrsquo11][BDLSTrsquo13]

                                                                  [Grsquo97][Trsquo97] [BDLSTrsquo13] [BDLSTrsquo13]

                                                                  QR[EGrsquo98][FWrsquo03]

                                                                  [DGHLrsquo12][BDLSTrsquo13]

                                                                  [FWrsquo03][DGHLrsquo12][BDLSTrsquo13]

                                                                  [EGrsquo98][FWrsquo03][BDLSTrsquo13]

                                                                  [FWrsquo03][BDLSTrsquo13]

                                                                  Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

                                                                  Sym Eig amp SVD [BDDrsquo11][BDKrsquo13] [BDDrsquo11]

                                                                  Non Sym Eig [BDDrsquo11] [BDDrsquo11]

                                                                  Legend[Existing][Ours][Math-Lib][Random]

                                                                  Words (BW) Messages (L) Saving factor

                                                                  BLAS-3 [AGZrsquo94][MTrsquo99][ScaLAPACK][Crsquo69][vGWrsquo97][SDrsquo11] L nP12

                                                                  Cholesky [ScaLAPACK][Trsquo99][SDrsquo11] L nP12

                                                                  Sym Indefinite [BBDDDPSTYrsquo13][ScaLAPACK] [BBDDDPSTYrsquo13] L nP12

                                                                  LU [ScaLAPACK][GDXrsquo11][Trsquo99][SDrsquo11] [GDXrsquo11][Trsquo99][SDrsquo11] L nP12

                                                                  QR [ScaLAPACK][DGHLrsquo12] [Trsquo99] [DGHLrsquo12][Trsquo99] L nP12

                                                                  Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

                                                                  Sym Eig amp SVD [BDDrsquo11][BDKrsquo13][ScaLAPACK] [BDDrsquo11][BDKrsquo13] L nP12

                                                                  Non-Sym Eig [BDDrsquo11] [BDDrsquo11] BW P12 L n

                                                                  Attaining with extra memory 25D M=(cn2P)

                                                                  Attaining the Lower bounds Parallel 2DM=(n2P)(Ignoring poly-log(P) factors words = ( n2 P12) messages = (P12)

                                                                  Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                  ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                  ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                  bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                  bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                  Avoiding Communication in Iterative Linear Algebra

                                                                  bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

                                                                  bull Conjugate Gradients (CG) GMRES Lanczos Arnoldi hellip bull Goal minimize communication

                                                                  ndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

                                                                  bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

                                                                  ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

                                                                  bull Lots of speed up possible (modeled and measured)ndash Price some redundant computationndash Challenges Poor partitioning Preconditioning Num Stability

                                                                  75

                                                                  Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                  ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                  ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                  bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                  bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                  Example The Difficulty of Tuning SpMV

                                                                  bull n = 21200bull nnz = 15 M

                                                                  bull Source NASA structural analysis problem (raefsky)

                                                                  77

                                                                  Example The Difficulty of Tuning

                                                                  bull n = 21200bull nnz = 15 M

                                                                  bull Source NASA structural analysis problem (raefsky)

                                                                  bull 8x8 dense substructure exploit this to limit mem_refs

                                                                  78

                                                                  Speedups on Itanium 2 The Need for Search

                                                                  Reference

                                                                  Best 4x2

                                                                  Mflops

                                                                  Mflops

                                                                  79

                                                                  Register Profile Itanium 2

                                                                  190 Mflops

                                                                  1190 Mflops

                                                                  80

                                                                  Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16

                                                                  Itanium 2 - 33Itanium 1 - 8

                                                                  252 Mflops

                                                                  122 Mflops

                                                                  820 Mflops

                                                                  459 Mflops

                                                                  247 Mflops

                                                                  107 Mflops

                                                                  12 Gflops

                                                                  190 Mflops

                                                                  Another example of tuning challenges for SpMV

                                                                  bull Ex11 matrix (fluid flow)

                                                                  bull More complicated non-zero structure in general

                                                                  bull N = 16614bull NNZ = 11M

                                                                  82

                                                                  Zoom in to top corner

                                                                  bull More complicated non-zero structure in general

                                                                  bull N = 16614bull NNZ = 11M

                                                                  83

                                                                  3x3 blocks look natural buthellip

                                                                  bull Example 3x3 blockingndash Logical grid of 3x3 cells

                                                                  bull But would lead to lots of ldquofill-inrdquo

                                                                  84

                                                                  Extra Work Can Improve Efficiency

                                                                  bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

                                                                  bull On Pentium III 15x speedupndash Actual mflop rate 152 = 225 higher

                                                                  85

                                                                  Source Accelerator Cavity Design Problem (Ko via Husbands)

                                                                  86

                                                                  100x100 Submatrix Along Diagonal

                                                                  Summer School Lecture 787

                                                                  Post-RCM Reordering

                                                                  88

                                                                  Effect of Combined RCM+TSP Reordering

                                                                  Before Green + RedAfter Green + Blue

                                                                  Summer School Lecture 789

                                                                  2x speedups on Pentium 4 Power 4 hellip

                                                                  Summary of Other Performance Optimizations

                                                                  bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

                                                                  bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

                                                                  bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

                                                                  90

                                                                  Optimized Sparse Kernel Interface - OSKI

                                                                  bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

                                                                  bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

                                                                  bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

                                                                  software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

                                                                  91

                                                                  Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                  ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                  ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                  bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                  bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                  93

                                                                  Example Classical Conjugate Gradient (CG)

                                                                  SpMVs and dot products require communication in

                                                                  each iteration

                                                                  via CA Matrix Powers Kernel

                                                                  Global reduction to compute G

                                                                  94

                                                                  Example CA-Conjugate Gradient

                                                                  Local computations within inner loop require

                                                                  no communication

                                                                  Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                  ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                  ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                  bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                  bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                  96

                                                                  Slower convergence due

                                                                  to roundoff

                                                                  Loss of accuracy due to roundoff

                                                                  At s = 16 monomial basis is rank deficient Method breaks down

                                                                  Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

                                                                  CA-CG (monomial)CG

                                                                  machine precision

                                                                  97

                                                                  Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                  ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                  ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                  bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                  bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                  What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

                                                                  bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

                                                                  bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

                                                                  matrices

                                                                  Explicit (O(nnz)) Implicit (o(nnz))

                                                                  Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

                                                                  Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

                                                                  Indices

                                                                  Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                  ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                  ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                  bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                  bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                  101

                                                                  bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

                                                                  ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

                                                                  demanded by customers (construction engineers) otherwise they donrsquot believe results

                                                                  ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

                                                                  ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

                                                                  Reproducible Floating Point Computation

                                                                  Absolute Error for Random Vectors

                                                                  Same magnitude opposite signs

                                                                  Intel MKL non-reproducibility

                                                                  Relative Error for Orthogonal vectors

                                                                  Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

                                                                  Sign notreproducible

                                                                  103

                                                                  bull Consider summation or dot productbull Goals

                                                                  1 Same answer independent of layout processors order of summands

                                                                  2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

                                                                  bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

                                                                  GoalsApproaches for Reproducibility

                                                                  104

                                                                  Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

                                                                  Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

                                                                  Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

                                                                  bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

                                                                  Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

                                                                  bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

                                                                  Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

                                                                  Instruments NEC Nokia NVIDIA Samsung Oracle

                                                                  bull bebopcsberkeleyedu

                                                                  Summary

                                                                  Donrsquot Communichellip

                                                                  106

                                                                  Time to redesign all linear algebra n-body hellip algorithms and software

                                                                  (and compilers)

                                                                  • Implementing Communication-Avoiding Algorithms
                                                                  • Why avoid communication
                                                                  • Goals
                                                                  • Outline
                                                                  • Outline (2)
                                                                  • Lower bound for all ldquon3-likerdquo linear algebra
                                                                  • Lower bound for all ldquon3-likerdquo linear algebra (2)
                                                                  • Lower bound for all ldquon3-likerdquo linear algebra (3)
                                                                  • Limits to parallel scaling (12)
                                                                  • Limits to parallel scaling (22)
                                                                  • Can we attain these lower bounds
                                                                  • Outline (3)
                                                                  • 25D Matrix Multiplication
                                                                  • 25D Matrix Multiplication (2)
                                                                  • 25D Matmul on BGP 16K nodes 64K cores (2)
                                                                  • Perfect Strong Scaling ndash in Time and Energy (12)
                                                                  • Perfect Strong Scaling ndash in Time and Energy (22)
                                                                  • Handling Heterogeneity
                                                                  • Application to Tensor Contractions
                                                                  • C(ijk) = Σm A(ijm)B(mk)
                                                                  • Application to Tensor Contractions (2)
                                                                  • Communication Lower Bounds for Strassen-like matmul algorithms
                                                                  • vs
                                                                  • Slide 26
                                                                  • Strassen-like beyond matmul
                                                                  • Cache and Network Oblivious Algorithms
                                                                  • CARMA Performance Distributed Memory
                                                                  • CARMA Performance Distributed Memory (2)
                                                                  • CARMA Performance Shared Memory
                                                                  • CARMA Performance Shared Memory (2)
                                                                  • Why is CARMA Faster in Shared Memory
                                                                  • Outline (4)
                                                                  • One-sided Factorizations (LU QR) so far
                                                                  • TSQR An Architecture-Dependent Algorithm
                                                                  • Back to LU Using similar idea for TSLU as TSQR Use reduction
                                                                  • Minimizing Communication in TSLU
                                                                  • Making TSLU Numerically Stable
                                                                  • Stability of LU using TSLU CALU
                                                                  • Why is stability of TSLU just a ldquoThmrdquo
                                                                  • Fixing TSLU
                                                                  • 2D CALU with Tournament Pivoting
                                                                  • 25D CALU with Tournament Pivoting (c=4 copies)
                                                                  • Exascale Machine Parameters Source DOE Exascale Workshop
                                                                  • Exascale predicted speedups for Gaussian Elimination 2D CA
                                                                  • 25D vs 2D LU With and Without Pivoting
                                                                  • Other CA algorithms for Ax=b least squares(13)
                                                                  • Other CA algorithms for Ax=b least squares (23)
                                                                  • Other CA algorithms for Ax=b least squares (33)
                                                                  • Outline (5)
                                                                  • What about sparse matrices (13)
                                                                  • Performance of 25D APSP using Kleene
                                                                  • What about sparse matrices (23)
                                                                  • What about sparse matrices (33)
                                                                  • Outline (6)
                                                                  • Symmetric Eigenproblem and SVD
                                                                  • Slide 58
                                                                  • Slide 59
                                                                  • Slide 60
                                                                  • Slide 61
                                                                  • Slide 62
                                                                  • Slide 63
                                                                  • Slide 64
                                                                  • Slide 65
                                                                  • Slide 66
                                                                  • Slide 67
                                                                  • Slide 68
                                                                  • Conventional vs CA - SBR
                                                                  • Speedups of Sym Band Reduction vs DSBTRD
                                                                  • Nonsymmetric Eigenproblem
                                                                  • Attaining the Lower bounds Sequential
                                                                  • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
                                                                  • Outline (7)
                                                                  • Avoiding Communication in Iterative Linear Algebra
                                                                  • Outline (8)
                                                                  • Example The Difficulty of Tuning SpMV
                                                                  • Example The Difficulty of Tuning
                                                                  • Speedups on Itanium 2 The Need for Search
                                                                  • Register Profile Itanium 2
                                                                  • Register Profiles IBM and Intel IA-64
                                                                  • Another example of tuning challenges for SpMV
                                                                  • Zoom in to top corner
                                                                  • 3x3 blocks look natural buthellip
                                                                  • Extra Work Can Improve Efficiency
                                                                  • Slide 86
                                                                  • Slide 87
                                                                  • Slide 88
                                                                  • Slide 89
                                                                  • Summary of Other Performance Optimizations
                                                                  • Optimized Sparse Kernel Interface - OSKI
                                                                  • Outline (9)
                                                                  • Example Classical Conjugate Gradient (CG)
                                                                  • Example CA-Conjugate Gradient
                                                                  • Outline (10)
                                                                  • Slide 96
                                                                  • Slide 97
                                                                  • Outline (11)
                                                                  • What is a ldquosparse matrixrdquo
                                                                  • Outline (12)
                                                                  • Reproducible Floating Point Computation
                                                                  • Intel MKL non-reproducibility
                                                                  • GoalsApproaches for Reproducibility
                                                                  • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
                                                                  • Collaborators and Supporters
                                                                  • Summary

                                                                    TSQR An Architecture-Dependent Algorithm

                                                                    W =

                                                                    W0

                                                                    W1

                                                                    W2

                                                                    W3

                                                                    R00

                                                                    R10

                                                                    R20

                                                                    R30

                                                                    R01

                                                                    R11

                                                                    R02Parallel

                                                                    W =

                                                                    W0

                                                                    W1

                                                                    W2

                                                                    W3

                                                                    R01 R02

                                                                    R00

                                                                    R03

                                                                    SequentialStreaming

                                                                    W =

                                                                    W0

                                                                    W1

                                                                    W2

                                                                    W3

                                                                    R00

                                                                    R01R01

                                                                    R11

                                                                    R02

                                                                    R11

                                                                    R03

                                                                    Dual Core

                                                                    Can choose reduction tree dynamicallyMulticore Multisocket Multirack Multisite Out-of-core

                                                                    Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

                                                                    Wnxb =

                                                                    W1

                                                                    W2

                                                                    W3

                                                                    W4

                                                                    P1middotL1middotU1

                                                                    P2middotL2middotU2

                                                                    P3middotL3middotU3

                                                                    P4middotL4middotU4

                                                                    =

                                                                    Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

                                                                    W1rsquoW2rsquoW3rsquoW4rsquo

                                                                    P12middotL12middotU12

                                                                    P34middotL34middotU34

                                                                    =Choose b pivot rows call them W12rsquo

                                                                    Choose b pivot rows call them W34rsquo

                                                                    W12rsquoW34rsquo

                                                                    = P1234middotL1234middotU1234

                                                                    Choose b pivot rows

                                                                    Go back to W and use these b pivot rows (move them to top do LU without pivoting)

                                                                    37

                                                                    Minimizing Communication in TSLU

                                                                    W = W1

                                                                    W2

                                                                    W3

                                                                    W4

                                                                    LULULULU

                                                                    LU

                                                                    LULUParallel

                                                                    W = W1

                                                                    W2

                                                                    W3

                                                                    W4

                                                                    LULU

                                                                    LU

                                                                    LUSequentialStreaming

                                                                    W = W1

                                                                    W2

                                                                    W3

                                                                    W4

                                                                    LULU LU

                                                                    LULU

                                                                    LULU

                                                                    Dual Core

                                                                    Can choose reduction tree dynamically to match architecture as before

                                                                    38

                                                                    Making TSLU Numerically Stable

                                                                    bull Details matterndash Going up the tree we could do LU either on original rows of A

                                                                    (tournament pivoting) or computed rows of Undash Only tournament pivoting stable

                                                                    bull ldquoThmrdquo New scheme as stable as Partial Pivoting (GEPP) in following sense Get same Schur complements as GEPP applied to different input matrix whose entries are blocks taken from input A

                                                                    bull Why just a ldquoThmrdquo

                                                                    39

                                                                    Stability of LU using TSLU CALU

                                                                    Summer School Lecture 4 40

                                                                    bull Empirical testingndash Both random matrices and ldquospecial onesrdquondash Both binary tree (BCALU) and flat-tree (FCALU)ndash 3 metrics ||PA-LU||||A|| normwise and componentwise backward errorsndash See [D Grigori Xiang 2010] for details

                                                                    Why is stability of TSLU just a ldquoThmrdquo

                                                                    bull Proof is correct ndash in exact arithmeticbull Experiment

                                                                    ndash Generate 100 random 6x6 rank 3 matrices in Matlabndash [LUP] = lu(A) do LU without pivoting on PA compare L factors are

                                                                    they the samebull Compute || L ndash Lnp || A few 0rsquos A few infinrsquos a few NaNsbull Rest mostly O(1)

                                                                    ndash Why Floating point is nonassociative doing arithmetic in different order gives different rounding errors

                                                                    ndash Same experiment with rank 6 matrices || L ndash Lnp || usually nonzero O(macheps)

                                                                    ndash Same experiment with 20x20 rank 4 matrices || L ndash Lnp || often O(103)

                                                                    bull Much harder to break TSLU but possiblendash Occurred when using TSLU to factorize a low-rank subdiagonal

                                                                    panel in symmetric-indefinite factorization 41

                                                                    Fixing TSLU

                                                                    bull Run TSLU quickly test for stability fix if necessary (rare)

                                                                    bull Test conditioning of U if not tiny (usual case) proceed elsebull Compute || L || if not big (usual case) proceed elsebull Factor A = QR using TSQR thenbull Factor Q = PLU using TSLU thenbull A = PL(UR) with UR as upper triangular factor

                                                                    bull Last topic in lecture how to guarantee floating point reproducibility

                                                                    42

                                                                    2D CALU with Tournament Pivoting

                                                                    43

                                                                    25D CALU with Tournament Pivoting (c=4 copies)

                                                                    44

                                                                    Exascale Machine ParametersSource DOE Exascale Workshop

                                                                    bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

                                                                    Exascale predicted speedupsfor Gaussian Elimination

                                                                    2D CA-LU vs ScaLAPACK-LU

                                                                    log2 (p)

                                                                    log 2

                                                                    (n2 p

                                                                    ) =

                                                                    log 2

                                                                    (mem

                                                                    ory_

                                                                    per_

                                                                    proc

                                                                    )

                                                                    Up to 29x

                                                                    25D vs 2D LUWith and Without Pivoting

                                                                    Other CA algorithms for Ax=b least squares(13)

                                                                    bull A symmetric and indefinitendash Seek factorization that retains symmetry PAPT = LDLT D

                                                                    ldquosimplerdquobull Save frac12 flops preserve inertia

                                                                    ndash Usual approach Bunch-Kaufmanbull D block diagonal with 1x1 and 2x2 blocksbull Pivot search down column along row (lots of communication)

                                                                    ndash Alternative Aasenbull D = tridiagonal = Tbull Two steps

                                                                    ndash PAPT = LTLT where T is banded using TSLU

                                                                    48

                                                                    0 0

                                                                    0

                                                                    0 0

                                                                    0

                                                                    0

                                                                    hellip

                                                                    hellip

                                                                    ndash Solvefactor narrow band problem with Tbull Up to 28x faster than MKL Best Paper at IPDPSrsquo13

                                                                    Other CA algorithms for Ax=b least squares (23)bull Minimizing bandwidth and latency for sequential GEPP

                                                                    ndash So far could not do partial pivoting and minimize messages just words

                                                                    ndash Challengebull Column layout good for choosing pivots bad for matmulbull Blocked layout good for matmul bad for choosing pivots

                                                                    ndash Solution use both layouts switching between thembull ldquoShape Morphing LUrdquo or SMLU

                                                                    49

                                                                    bull func factor(A) if A has 1 column update it else factor(left half of A)

                                                                    update right half of A

                                                                    factor(right half of A)

                                                                    bull Words = O(n3M12)

                                                                    bull Messages = O(n3M)

                                                                    bull func factor(A) if A has 1 column update it else factor(left half of A) reshape to recursive block format update right half of A reshape to columnwise format factor(right half of A)

                                                                    bull Words = O(n3M12)

                                                                    bull Messages = O(n3M32)

                                                                    Other CA algorithms for Ax=b least squares (33)bull Need for pivoting arises beyond LU in QR

                                                                    ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)

                                                                    ndash Usual approach like Partial Pivotingbull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound

                                                                    ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two

                                                                    groups of b columns either using usual approach or something better (GuEisenstat)

                                                                    bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix

                                                                    ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 50

                                                                    Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                    ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                    ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                    bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                    bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                    What about sparse matrices (13)

                                                                    bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

                                                                    bull But canrsquot reorder outer loop for 25D need another ideabull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = AB

                                                                    ndash Dependencies ok 25D works just different semiringbull Kleenersquos Algorithm

                                                                    52

                                                                    for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

                                                                    D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

                                                                    Performance of 25D APSP using Kleene

                                                                    53

                                                                    Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

                                                                    62xspeedup

                                                                    2x speedup

                                                                    What about sparse matrices (23)

                                                                    bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of G(A)

                                                                    have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

                                                                    bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering for

                                                                    2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

                                                                    bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

                                                                    multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

                                                                    separators)

                                                                    54

                                                                    What about sparse matrices (33)

                                                                    bull If matrix stays very sparse lower bound unattainable new one bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

                                                                    data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

                                                                    bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

                                                                    Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

                                                                    bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

                                                                    along dimensions most likely to minimize cost55

                                                                    Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                    ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                    ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                    bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                    bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                    Symmetric Eigenproblem and SVD

                                                                    bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd

                                                                    bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12

                                                                    ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea

                                                                    b+1

                                                                    b+1

                                                                    Successive Band Reduction (BischofLangSun)

                                                                    1

                                                                    b+1

                                                                    b+1

                                                                    d+1

                                                                    c

                                                                    Successive Band Reduction (BischofLangSun)

                                                                    b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                    1Q1

                                                                    b+1

                                                                    b+1

                                                                    d+1

                                                                    c

                                                                    b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                    Successive Band Reduction (BischofLangSun)

                                                                    12

                                                                    Q1

                                                                    b+1

                                                                    b+1

                                                                    d+1

                                                                    d+c

                                                                    d+c

                                                                    c

                                                                    b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                    Successive Band Reduction (BischofLangSun)

                                                                    1

                                                                    12

                                                                    Q1

                                                                    Q1T

                                                                    b+1

                                                                    b+1

                                                                    d+1

                                                                    d+1

                                                                    cd+c

                                                                    d+c

                                                                    c

                                                                    b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                    Successive Band Reduction (BischofLangSun)

                                                                    1

                                                                    1

                                                                    2

                                                                    2Q1

                                                                    Q1T

                                                                    b+1

                                                                    b+1

                                                                    d+1

                                                                    d+1

                                                                    cd+c

                                                                    d+c

                                                                    d+c

                                                                    d+c

                                                                    c

                                                                    b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                    Successive Band Reduction (BischofLangSun)

                                                                    1

                                                                    1

                                                                    2

                                                                    2

                                                                    3

                                                                    3

                                                                    Q1

                                                                    Q1T

                                                                    Q2

                                                                    Q2T

                                                                    b+1

                                                                    b+1

                                                                    d+1

                                                                    d+1

                                                                    d+c

                                                                    d+c

                                                                    d+c

                                                                    d+c

                                                                    c

                                                                    c

                                                                    b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                    Successive Band Reduction (BischofLangSun)

                                                                    1

                                                                    1

                                                                    2

                                                                    2

                                                                    3

                                                                    3

                                                                    4

                                                                    4

                                                                    Q1

                                                                    Q1T

                                                                    Q2

                                                                    Q2T

                                                                    Q3

                                                                    Q3T

                                                                    b+1

                                                                    b+1

                                                                    d+1

                                                                    d+1

                                                                    d+c

                                                                    d+c

                                                                    d+c

                                                                    d+c

                                                                    c

                                                                    c

                                                                    b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                    Successive Band Reduction (BischofLangSun)

                                                                    1

                                                                    1

                                                                    2

                                                                    2

                                                                    3

                                                                    3

                                                                    4

                                                                    4

                                                                    5

                                                                    5

                                                                    Q1

                                                                    Q1T

                                                                    Q2

                                                                    Q2T

                                                                    Q3

                                                                    Q3T

                                                                    Q4

                                                                    Q4T

                                                                    b+1

                                                                    b+1

                                                                    d+1

                                                                    d+1

                                                                    c

                                                                    c

                                                                    d+c

                                                                    d+c

                                                                    d+c

                                                                    d+c

                                                                    b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                    Successive Band Reduction (BischofLangSun)

                                                                    1

                                                                    1

                                                                    2

                                                                    2

                                                                    3

                                                                    3

                                                                    4

                                                                    4

                                                                    5

                                                                    5

                                                                    Q5T

                                                                    Q1

                                                                    Q1T

                                                                    Q2

                                                                    Q2T

                                                                    Q3

                                                                    Q3T

                                                                    Q5

                                                                    Q4

                                                                    Q4T

                                                                    b+1

                                                                    b+1

                                                                    d+1

                                                                    d+1

                                                                    c

                                                                    c

                                                                    d+c

                                                                    d+c

                                                                    d+c

                                                                    d+c

                                                                    b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                    Successive Band Reduction (BischofLangSun)

                                                                    1

                                                                    1

                                                                    2

                                                                    2

                                                                    3

                                                                    3

                                                                    4

                                                                    4

                                                                    5

                                                                    5

                                                                    6

                                                                    6

                                                                    Q5T

                                                                    Q1

                                                                    Q1T

                                                                    Q2

                                                                    Q2T

                                                                    Q3

                                                                    Q3T

                                                                    Q5

                                                                    Q4

                                                                    Q4T

                                                                    b+1

                                                                    b+1

                                                                    d+1

                                                                    d+1

                                                                    c

                                                                    c

                                                                    d+c

                                                                    d+c

                                                                    d+c

                                                                    d+c

                                                                    b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                    Successive Band Reduction (BischofLangSun)

                                                                    Conventional vs CA - SBR

                                                                    Conventional Communication-Avoiding

                                                                    Touch all data 4 times Touch all data once

                                                                    >
                                                                    >

                                                                    Speedups of Sym Band Reductionvs DSBTRD

                                                                    bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

                                                                    bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

                                                                    bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

                                                                    bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

                                                                    bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

                                                                    Nonsymmetric Eigenproblem

                                                                    bull No apparent way to modify standard algorithmbull Instead Spectral Divide-and-Conquer

                                                                    ndash Find orthogonal matrix Q whose leading columns span an invariant subspace of A

                                                                    ndash QTAQ will be block upper triangularndash Apply recursively to A11 A22

                                                                    ndash Depends on randomization1 Randomized Rank Revealing QR decomposition2 Randomized location to try splitting spectrum

                                                                    A11 A12

                                                                    ε A22

                                                                    Attaining the Lower bounds SequentialLegend[Existing][Ours][Math-Lib][Random]

                                                                    Two Levels Memory Hierarchy

                                                                    Words Messages Words Messages

                                                                    BLAS-3 [FLPRrsquo99][BDLSTrsquo13][MKL etc] [FLPRrsquo99][BDLSTrsquo13][MKL etc]

                                                                    Cholesky[Grsquo97][APrsquo00]

                                                                    [LAPACK][BDHSrsquo09]

                                                                    [Grsquo97][APrsquo00][BDHSrsquo09] [Grsquo97][APrsquo00][BDHSrsquo09]

                                                                    Sym Indefinite [BBDDDPSTYrsquo13] [BBDDDPSTYrsquo13]

                                                                    LU[Grsquo97][Trsquo97]

                                                                    [GDXrsquo11][BDLSTrsquo13]

                                                                    [GDXrsquo11][BDLSTrsquo13]

                                                                    [Grsquo97][Trsquo97] [BDLSTrsquo13] [BDLSTrsquo13]

                                                                    QR[EGrsquo98][FWrsquo03]

                                                                    [DGHLrsquo12][BDLSTrsquo13]

                                                                    [FWrsquo03][DGHLrsquo12][BDLSTrsquo13]

                                                                    [EGrsquo98][FWrsquo03][BDLSTrsquo13]

                                                                    [FWrsquo03][BDLSTrsquo13]

                                                                    Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

                                                                    Sym Eig amp SVD [BDDrsquo11][BDKrsquo13] [BDDrsquo11]

                                                                    Non Sym Eig [BDDrsquo11] [BDDrsquo11]

                                                                    Legend[Existing][Ours][Math-Lib][Random]

                                                                    Words (BW) Messages (L) Saving factor

                                                                    BLAS-3 [AGZrsquo94][MTrsquo99][ScaLAPACK][Crsquo69][vGWrsquo97][SDrsquo11] L nP12

                                                                    Cholesky [ScaLAPACK][Trsquo99][SDrsquo11] L nP12

                                                                    Sym Indefinite [BBDDDPSTYrsquo13][ScaLAPACK] [BBDDDPSTYrsquo13] L nP12

                                                                    LU [ScaLAPACK][GDXrsquo11][Trsquo99][SDrsquo11] [GDXrsquo11][Trsquo99][SDrsquo11] L nP12

                                                                    QR [ScaLAPACK][DGHLrsquo12] [Trsquo99] [DGHLrsquo12][Trsquo99] L nP12

                                                                    Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

                                                                    Sym Eig amp SVD [BDDrsquo11][BDKrsquo13][ScaLAPACK] [BDDrsquo11][BDKrsquo13] L nP12

                                                                    Non-Sym Eig [BDDrsquo11] [BDDrsquo11] BW P12 L n

                                                                    Attaining with extra memory 25D M=(cn2P)

                                                                    Attaining the Lower bounds Parallel 2DM=(n2P)(Ignoring poly-log(P) factors words = ( n2 P12) messages = (P12)

                                                                    Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                    ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                    ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                    bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                    bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                    Avoiding Communication in Iterative Linear Algebra

                                                                    bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

                                                                    bull Conjugate Gradients (CG) GMRES Lanczos Arnoldi hellip bull Goal minimize communication

                                                                    ndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

                                                                    bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

                                                                    ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

                                                                    bull Lots of speed up possible (modeled and measured)ndash Price some redundant computationndash Challenges Poor partitioning Preconditioning Num Stability

                                                                    75

                                                                    Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                    ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                    ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                    bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                    bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                    Example The Difficulty of Tuning SpMV

                                                                    bull n = 21200bull nnz = 15 M

                                                                    bull Source NASA structural analysis problem (raefsky)

                                                                    77

                                                                    Example The Difficulty of Tuning

                                                                    bull n = 21200bull nnz = 15 M

                                                                    bull Source NASA structural analysis problem (raefsky)

                                                                    bull 8x8 dense substructure exploit this to limit mem_refs

                                                                    78

                                                                    Speedups on Itanium 2 The Need for Search

                                                                    Reference

                                                                    Best 4x2

                                                                    Mflops

                                                                    Mflops

                                                                    79

                                                                    Register Profile Itanium 2

                                                                    190 Mflops

                                                                    1190 Mflops

                                                                    80

                                                                    Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16

                                                                    Itanium 2 - 33Itanium 1 - 8

                                                                    252 Mflops

                                                                    122 Mflops

                                                                    820 Mflops

                                                                    459 Mflops

                                                                    247 Mflops

                                                                    107 Mflops

                                                                    12 Gflops

                                                                    190 Mflops

                                                                    Another example of tuning challenges for SpMV

                                                                    bull Ex11 matrix (fluid flow)

                                                                    bull More complicated non-zero structure in general

                                                                    bull N = 16614bull NNZ = 11M

                                                                    82

                                                                    Zoom in to top corner

                                                                    bull More complicated non-zero structure in general

                                                                    bull N = 16614bull NNZ = 11M

                                                                    83

                                                                    3x3 blocks look natural buthellip

                                                                    bull Example 3x3 blockingndash Logical grid of 3x3 cells

                                                                    bull But would lead to lots of ldquofill-inrdquo

                                                                    84

                                                                    Extra Work Can Improve Efficiency

                                                                    bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

                                                                    bull On Pentium III 15x speedupndash Actual mflop rate 152 = 225 higher

                                                                    85

                                                                    Source Accelerator Cavity Design Problem (Ko via Husbands)

                                                                    86

                                                                    100x100 Submatrix Along Diagonal

                                                                    Summer School Lecture 787

                                                                    Post-RCM Reordering

                                                                    88

                                                                    Effect of Combined RCM+TSP Reordering

                                                                    Before Green + RedAfter Green + Blue

                                                                    Summer School Lecture 789

                                                                    2x speedups on Pentium 4 Power 4 hellip

                                                                    Summary of Other Performance Optimizations

                                                                    bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

                                                                    bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

                                                                    bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

                                                                    90

                                                                    Optimized Sparse Kernel Interface - OSKI

                                                                    bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

                                                                    bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

                                                                    bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

                                                                    software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

                                                                    91

                                                                    Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                    ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                    ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                    bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                    bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                    93

                                                                    Example Classical Conjugate Gradient (CG)

                                                                    SpMVs and dot products require communication in

                                                                    each iteration

                                                                    via CA Matrix Powers Kernel

                                                                    Global reduction to compute G

                                                                    94

                                                                    Example CA-Conjugate Gradient

                                                                    Local computations within inner loop require

                                                                    no communication

                                                                    Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                    ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                    ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                    bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                    bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                    96

                                                                    Slower convergence due

                                                                    to roundoff

                                                                    Loss of accuracy due to roundoff

                                                                    At s = 16 monomial basis is rank deficient Method breaks down

                                                                    Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

                                                                    CA-CG (monomial)CG

                                                                    machine precision

                                                                    97

                                                                    Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                    ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                    ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                    bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                    bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                    What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

                                                                    bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

                                                                    bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

                                                                    matrices

                                                                    Explicit (O(nnz)) Implicit (o(nnz))

                                                                    Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

                                                                    Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

                                                                    Indices

                                                                    Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                    ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                    ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                    bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                    bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                    101

                                                                    bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

                                                                    ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

                                                                    demanded by customers (construction engineers) otherwise they donrsquot believe results

                                                                    ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

                                                                    ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

                                                                    Reproducible Floating Point Computation

                                                                    Absolute Error for Random Vectors

                                                                    Same magnitude opposite signs

                                                                    Intel MKL non-reproducibility

                                                                    Relative Error for Orthogonal vectors

                                                                    Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

                                                                    Sign notreproducible

                                                                    103

                                                                    bull Consider summation or dot productbull Goals

                                                                    1 Same answer independent of layout processors order of summands

                                                                    2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

                                                                    bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

                                                                    GoalsApproaches for Reproducibility

                                                                    104

                                                                    Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

                                                                    Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

                                                                    Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

                                                                    bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

                                                                    Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

                                                                    bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

                                                                    Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

                                                                    Instruments NEC Nokia NVIDIA Samsung Oracle

                                                                    bull bebopcsberkeleyedu

                                                                    Summary

                                                                    Donrsquot Communichellip

                                                                    106

                                                                    Time to redesign all linear algebra n-body hellip algorithms and software

                                                                    (and compilers)

                                                                    • Implementing Communication-Avoiding Algorithms
                                                                    • Why avoid communication
                                                                    • Goals
                                                                    • Outline
                                                                    • Outline (2)
                                                                    • Lower bound for all ldquon3-likerdquo linear algebra
                                                                    • Lower bound for all ldquon3-likerdquo linear algebra (2)
                                                                    • Lower bound for all ldquon3-likerdquo linear algebra (3)
                                                                    • Limits to parallel scaling (12)
                                                                    • Limits to parallel scaling (22)
                                                                    • Can we attain these lower bounds
                                                                    • Outline (3)
                                                                    • 25D Matrix Multiplication
                                                                    • 25D Matrix Multiplication (2)
                                                                    • 25D Matmul on BGP 16K nodes 64K cores (2)
                                                                    • Perfect Strong Scaling ndash in Time and Energy (12)
                                                                    • Perfect Strong Scaling ndash in Time and Energy (22)
                                                                    • Handling Heterogeneity
                                                                    • Application to Tensor Contractions
                                                                    • C(ijk) = Σm A(ijm)B(mk)
                                                                    • Application to Tensor Contractions (2)
                                                                    • Communication Lower Bounds for Strassen-like matmul algorithms
                                                                    • vs
                                                                    • Slide 26
                                                                    • Strassen-like beyond matmul
                                                                    • Cache and Network Oblivious Algorithms
                                                                    • CARMA Performance Distributed Memory
                                                                    • CARMA Performance Distributed Memory (2)
                                                                    • CARMA Performance Shared Memory
                                                                    • CARMA Performance Shared Memory (2)
                                                                    • Why is CARMA Faster in Shared Memory
                                                                    • Outline (4)
                                                                    • One-sided Factorizations (LU QR) so far
                                                                    • TSQR An Architecture-Dependent Algorithm
                                                                    • Back to LU Using similar idea for TSLU as TSQR Use reduction
                                                                    • Minimizing Communication in TSLU
                                                                    • Making TSLU Numerically Stable
                                                                    • Stability of LU using TSLU CALU
                                                                    • Why is stability of TSLU just a ldquoThmrdquo
                                                                    • Fixing TSLU
                                                                    • 2D CALU with Tournament Pivoting
                                                                    • 25D CALU with Tournament Pivoting (c=4 copies)
                                                                    • Exascale Machine Parameters Source DOE Exascale Workshop
                                                                    • Exascale predicted speedups for Gaussian Elimination 2D CA
                                                                    • 25D vs 2D LU With and Without Pivoting
                                                                    • Other CA algorithms for Ax=b least squares(13)
                                                                    • Other CA algorithms for Ax=b least squares (23)
                                                                    • Other CA algorithms for Ax=b least squares (33)
                                                                    • Outline (5)
                                                                    • What about sparse matrices (13)
                                                                    • Performance of 25D APSP using Kleene
                                                                    • What about sparse matrices (23)
                                                                    • What about sparse matrices (33)
                                                                    • Outline (6)
                                                                    • Symmetric Eigenproblem and SVD
                                                                    • Slide 58
                                                                    • Slide 59
                                                                    • Slide 60
                                                                    • Slide 61
                                                                    • Slide 62
                                                                    • Slide 63
                                                                    • Slide 64
                                                                    • Slide 65
                                                                    • Slide 66
                                                                    • Slide 67
                                                                    • Slide 68
                                                                    • Conventional vs CA - SBR
                                                                    • Speedups of Sym Band Reduction vs DSBTRD
                                                                    • Nonsymmetric Eigenproblem
                                                                    • Attaining the Lower bounds Sequential
                                                                    • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
                                                                    • Outline (7)
                                                                    • Avoiding Communication in Iterative Linear Algebra
                                                                    • Outline (8)
                                                                    • Example The Difficulty of Tuning SpMV
                                                                    • Example The Difficulty of Tuning
                                                                    • Speedups on Itanium 2 The Need for Search
                                                                    • Register Profile Itanium 2
                                                                    • Register Profiles IBM and Intel IA-64
                                                                    • Another example of tuning challenges for SpMV
                                                                    • Zoom in to top corner
                                                                    • 3x3 blocks look natural buthellip
                                                                    • Extra Work Can Improve Efficiency
                                                                    • Slide 86
                                                                    • Slide 87
                                                                    • Slide 88
                                                                    • Slide 89
                                                                    • Summary of Other Performance Optimizations
                                                                    • Optimized Sparse Kernel Interface - OSKI
                                                                    • Outline (9)
                                                                    • Example Classical Conjugate Gradient (CG)
                                                                    • Example CA-Conjugate Gradient
                                                                    • Outline (10)
                                                                    • Slide 96
                                                                    • Slide 97
                                                                    • Outline (11)
                                                                    • What is a ldquosparse matrixrdquo
                                                                    • Outline (12)
                                                                    • Reproducible Floating Point Computation
                                                                    • Intel MKL non-reproducibility
                                                                    • GoalsApproaches for Reproducibility
                                                                    • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
                                                                    • Collaborators and Supporters
                                                                    • Summary

                                                                      Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

                                                                      Wnxb =

                                                                      W1

                                                                      W2

                                                                      W3

                                                                      W4

                                                                      P1middotL1middotU1

                                                                      P2middotL2middotU2

                                                                      P3middotL3middotU3

                                                                      P4middotL4middotU4

                                                                      =

                                                                      Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

                                                                      W1rsquoW2rsquoW3rsquoW4rsquo

                                                                      P12middotL12middotU12

                                                                      P34middotL34middotU34

                                                                      =Choose b pivot rows call them W12rsquo

                                                                      Choose b pivot rows call them W34rsquo

                                                                      W12rsquoW34rsquo

                                                                      = P1234middotL1234middotU1234

                                                                      Choose b pivot rows

                                                                      Go back to W and use these b pivot rows (move them to top do LU without pivoting)

                                                                      37

                                                                      Minimizing Communication in TSLU

                                                                      W = W1

                                                                      W2

                                                                      W3

                                                                      W4

                                                                      LULULULU

                                                                      LU

                                                                      LULUParallel

                                                                      W = W1

                                                                      W2

                                                                      W3

                                                                      W4

                                                                      LULU

                                                                      LU

                                                                      LUSequentialStreaming

                                                                      W = W1

                                                                      W2

                                                                      W3

                                                                      W4

                                                                      LULU LU

                                                                      LULU

                                                                      LULU

                                                                      Dual Core

                                                                      Can choose reduction tree dynamically to match architecture as before

                                                                      38

                                                                      Making TSLU Numerically Stable

                                                                      bull Details matterndash Going up the tree we could do LU either on original rows of A

                                                                      (tournament pivoting) or computed rows of Undash Only tournament pivoting stable

                                                                      bull ldquoThmrdquo New scheme as stable as Partial Pivoting (GEPP) in following sense Get same Schur complements as GEPP applied to different input matrix whose entries are blocks taken from input A

                                                                      bull Why just a ldquoThmrdquo

                                                                      39

                                                                      Stability of LU using TSLU CALU

                                                                      Summer School Lecture 4 40

                                                                      bull Empirical testingndash Both random matrices and ldquospecial onesrdquondash Both binary tree (BCALU) and flat-tree (FCALU)ndash 3 metrics ||PA-LU||||A|| normwise and componentwise backward errorsndash See [D Grigori Xiang 2010] for details

                                                                      Why is stability of TSLU just a ldquoThmrdquo

                                                                      bull Proof is correct ndash in exact arithmeticbull Experiment

                                                                      ndash Generate 100 random 6x6 rank 3 matrices in Matlabndash [LUP] = lu(A) do LU without pivoting on PA compare L factors are

                                                                      they the samebull Compute || L ndash Lnp || A few 0rsquos A few infinrsquos a few NaNsbull Rest mostly O(1)

                                                                      ndash Why Floating point is nonassociative doing arithmetic in different order gives different rounding errors

                                                                      ndash Same experiment with rank 6 matrices || L ndash Lnp || usually nonzero O(macheps)

                                                                      ndash Same experiment with 20x20 rank 4 matrices || L ndash Lnp || often O(103)

                                                                      bull Much harder to break TSLU but possiblendash Occurred when using TSLU to factorize a low-rank subdiagonal

                                                                      panel in symmetric-indefinite factorization 41

                                                                      Fixing TSLU

                                                                      bull Run TSLU quickly test for stability fix if necessary (rare)

                                                                      bull Test conditioning of U if not tiny (usual case) proceed elsebull Compute || L || if not big (usual case) proceed elsebull Factor A = QR using TSQR thenbull Factor Q = PLU using TSLU thenbull A = PL(UR) with UR as upper triangular factor

                                                                      bull Last topic in lecture how to guarantee floating point reproducibility

                                                                      42

                                                                      2D CALU with Tournament Pivoting

                                                                      43

                                                                      25D CALU with Tournament Pivoting (c=4 copies)

                                                                      44

                                                                      Exascale Machine ParametersSource DOE Exascale Workshop

                                                                      bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

                                                                      Exascale predicted speedupsfor Gaussian Elimination

                                                                      2D CA-LU vs ScaLAPACK-LU

                                                                      log2 (p)

                                                                      log 2

                                                                      (n2 p

                                                                      ) =

                                                                      log 2

                                                                      (mem

                                                                      ory_

                                                                      per_

                                                                      proc

                                                                      )

                                                                      Up to 29x

                                                                      25D vs 2D LUWith and Without Pivoting

                                                                      Other CA algorithms for Ax=b least squares(13)

                                                                      bull A symmetric and indefinitendash Seek factorization that retains symmetry PAPT = LDLT D

                                                                      ldquosimplerdquobull Save frac12 flops preserve inertia

                                                                      ndash Usual approach Bunch-Kaufmanbull D block diagonal with 1x1 and 2x2 blocksbull Pivot search down column along row (lots of communication)

                                                                      ndash Alternative Aasenbull D = tridiagonal = Tbull Two steps

                                                                      ndash PAPT = LTLT where T is banded using TSLU

                                                                      48

                                                                      0 0

                                                                      0

                                                                      0 0

                                                                      0

                                                                      0

                                                                      hellip

                                                                      hellip

                                                                      ndash Solvefactor narrow band problem with Tbull Up to 28x faster than MKL Best Paper at IPDPSrsquo13

                                                                      Other CA algorithms for Ax=b least squares (23)bull Minimizing bandwidth and latency for sequential GEPP

                                                                      ndash So far could not do partial pivoting and minimize messages just words

                                                                      ndash Challengebull Column layout good for choosing pivots bad for matmulbull Blocked layout good for matmul bad for choosing pivots

                                                                      ndash Solution use both layouts switching between thembull ldquoShape Morphing LUrdquo or SMLU

                                                                      49

                                                                      bull func factor(A) if A has 1 column update it else factor(left half of A)

                                                                      update right half of A

                                                                      factor(right half of A)

                                                                      bull Words = O(n3M12)

                                                                      bull Messages = O(n3M)

                                                                      bull func factor(A) if A has 1 column update it else factor(left half of A) reshape to recursive block format update right half of A reshape to columnwise format factor(right half of A)

                                                                      bull Words = O(n3M12)

                                                                      bull Messages = O(n3M32)

                                                                      Other CA algorithms for Ax=b least squares (33)bull Need for pivoting arises beyond LU in QR

                                                                      ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)

                                                                      ndash Usual approach like Partial Pivotingbull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound

                                                                      ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two

                                                                      groups of b columns either using usual approach or something better (GuEisenstat)

                                                                      bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix

                                                                      ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 50

                                                                      Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                      ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                      ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                      bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                      bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                      What about sparse matrices (13)

                                                                      bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

                                                                      bull But canrsquot reorder outer loop for 25D need another ideabull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = AB

                                                                      ndash Dependencies ok 25D works just different semiringbull Kleenersquos Algorithm

                                                                      52

                                                                      for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

                                                                      D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

                                                                      Performance of 25D APSP using Kleene

                                                                      53

                                                                      Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

                                                                      62xspeedup

                                                                      2x speedup

                                                                      What about sparse matrices (23)

                                                                      bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of G(A)

                                                                      have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

                                                                      bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering for

                                                                      2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

                                                                      bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

                                                                      multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

                                                                      separators)

                                                                      54

                                                                      What about sparse matrices (33)

                                                                      bull If matrix stays very sparse lower bound unattainable new one bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

                                                                      data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

                                                                      bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

                                                                      Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

                                                                      bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

                                                                      along dimensions most likely to minimize cost55

                                                                      Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                      ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                      ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                      bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                      bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                      Symmetric Eigenproblem and SVD

                                                                      bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd

                                                                      bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12

                                                                      ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea

                                                                      b+1

                                                                      b+1

                                                                      Successive Band Reduction (BischofLangSun)

                                                                      1

                                                                      b+1

                                                                      b+1

                                                                      d+1

                                                                      c

                                                                      Successive Band Reduction (BischofLangSun)

                                                                      b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                      1Q1

                                                                      b+1

                                                                      b+1

                                                                      d+1

                                                                      c

                                                                      b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                      Successive Band Reduction (BischofLangSun)

                                                                      12

                                                                      Q1

                                                                      b+1

                                                                      b+1

                                                                      d+1

                                                                      d+c

                                                                      d+c

                                                                      c

                                                                      b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                      Successive Band Reduction (BischofLangSun)

                                                                      1

                                                                      12

                                                                      Q1

                                                                      Q1T

                                                                      b+1

                                                                      b+1

                                                                      d+1

                                                                      d+1

                                                                      cd+c

                                                                      d+c

                                                                      c

                                                                      b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                      Successive Band Reduction (BischofLangSun)

                                                                      1

                                                                      1

                                                                      2

                                                                      2Q1

                                                                      Q1T

                                                                      b+1

                                                                      b+1

                                                                      d+1

                                                                      d+1

                                                                      cd+c

                                                                      d+c

                                                                      d+c

                                                                      d+c

                                                                      c

                                                                      b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                      Successive Band Reduction (BischofLangSun)

                                                                      1

                                                                      1

                                                                      2

                                                                      2

                                                                      3

                                                                      3

                                                                      Q1

                                                                      Q1T

                                                                      Q2

                                                                      Q2T

                                                                      b+1

                                                                      b+1

                                                                      d+1

                                                                      d+1

                                                                      d+c

                                                                      d+c

                                                                      d+c

                                                                      d+c

                                                                      c

                                                                      c

                                                                      b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                      Successive Band Reduction (BischofLangSun)

                                                                      1

                                                                      1

                                                                      2

                                                                      2

                                                                      3

                                                                      3

                                                                      4

                                                                      4

                                                                      Q1

                                                                      Q1T

                                                                      Q2

                                                                      Q2T

                                                                      Q3

                                                                      Q3T

                                                                      b+1

                                                                      b+1

                                                                      d+1

                                                                      d+1

                                                                      d+c

                                                                      d+c

                                                                      d+c

                                                                      d+c

                                                                      c

                                                                      c

                                                                      b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                      Successive Band Reduction (BischofLangSun)

                                                                      1

                                                                      1

                                                                      2

                                                                      2

                                                                      3

                                                                      3

                                                                      4

                                                                      4

                                                                      5

                                                                      5

                                                                      Q1

                                                                      Q1T

                                                                      Q2

                                                                      Q2T

                                                                      Q3

                                                                      Q3T

                                                                      Q4

                                                                      Q4T

                                                                      b+1

                                                                      b+1

                                                                      d+1

                                                                      d+1

                                                                      c

                                                                      c

                                                                      d+c

                                                                      d+c

                                                                      d+c

                                                                      d+c

                                                                      b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                      Successive Band Reduction (BischofLangSun)

                                                                      1

                                                                      1

                                                                      2

                                                                      2

                                                                      3

                                                                      3

                                                                      4

                                                                      4

                                                                      5

                                                                      5

                                                                      Q5T

                                                                      Q1

                                                                      Q1T

                                                                      Q2

                                                                      Q2T

                                                                      Q3

                                                                      Q3T

                                                                      Q5

                                                                      Q4

                                                                      Q4T

                                                                      b+1

                                                                      b+1

                                                                      d+1

                                                                      d+1

                                                                      c

                                                                      c

                                                                      d+c

                                                                      d+c

                                                                      d+c

                                                                      d+c

                                                                      b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                      Successive Band Reduction (BischofLangSun)

                                                                      1

                                                                      1

                                                                      2

                                                                      2

                                                                      3

                                                                      3

                                                                      4

                                                                      4

                                                                      5

                                                                      5

                                                                      6

                                                                      6

                                                                      Q5T

                                                                      Q1

                                                                      Q1T

                                                                      Q2

                                                                      Q2T

                                                                      Q3

                                                                      Q3T

                                                                      Q5

                                                                      Q4

                                                                      Q4T

                                                                      b+1

                                                                      b+1

                                                                      d+1

                                                                      d+1

                                                                      c

                                                                      c

                                                                      d+c

                                                                      d+c

                                                                      d+c

                                                                      d+c

                                                                      b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                      Successive Band Reduction (BischofLangSun)

                                                                      Conventional vs CA - SBR

                                                                      Conventional Communication-Avoiding

                                                                      Touch all data 4 times Touch all data once

                                                                      >
                                                                      >

                                                                      Speedups of Sym Band Reductionvs DSBTRD

                                                                      bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

                                                                      bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

                                                                      bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

                                                                      bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

                                                                      bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

                                                                      Nonsymmetric Eigenproblem

                                                                      bull No apparent way to modify standard algorithmbull Instead Spectral Divide-and-Conquer

                                                                      ndash Find orthogonal matrix Q whose leading columns span an invariant subspace of A

                                                                      ndash QTAQ will be block upper triangularndash Apply recursively to A11 A22

                                                                      ndash Depends on randomization1 Randomized Rank Revealing QR decomposition2 Randomized location to try splitting spectrum

                                                                      A11 A12

                                                                      ε A22

                                                                      Attaining the Lower bounds SequentialLegend[Existing][Ours][Math-Lib][Random]

                                                                      Two Levels Memory Hierarchy

                                                                      Words Messages Words Messages

                                                                      BLAS-3 [FLPRrsquo99][BDLSTrsquo13][MKL etc] [FLPRrsquo99][BDLSTrsquo13][MKL etc]

                                                                      Cholesky[Grsquo97][APrsquo00]

                                                                      [LAPACK][BDHSrsquo09]

                                                                      [Grsquo97][APrsquo00][BDHSrsquo09] [Grsquo97][APrsquo00][BDHSrsquo09]

                                                                      Sym Indefinite [BBDDDPSTYrsquo13] [BBDDDPSTYrsquo13]

                                                                      LU[Grsquo97][Trsquo97]

                                                                      [GDXrsquo11][BDLSTrsquo13]

                                                                      [GDXrsquo11][BDLSTrsquo13]

                                                                      [Grsquo97][Trsquo97] [BDLSTrsquo13] [BDLSTrsquo13]

                                                                      QR[EGrsquo98][FWrsquo03]

                                                                      [DGHLrsquo12][BDLSTrsquo13]

                                                                      [FWrsquo03][DGHLrsquo12][BDLSTrsquo13]

                                                                      [EGrsquo98][FWrsquo03][BDLSTrsquo13]

                                                                      [FWrsquo03][BDLSTrsquo13]

                                                                      Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

                                                                      Sym Eig amp SVD [BDDrsquo11][BDKrsquo13] [BDDrsquo11]

                                                                      Non Sym Eig [BDDrsquo11] [BDDrsquo11]

                                                                      Legend[Existing][Ours][Math-Lib][Random]

                                                                      Words (BW) Messages (L) Saving factor

                                                                      BLAS-3 [AGZrsquo94][MTrsquo99][ScaLAPACK][Crsquo69][vGWrsquo97][SDrsquo11] L nP12

                                                                      Cholesky [ScaLAPACK][Trsquo99][SDrsquo11] L nP12

                                                                      Sym Indefinite [BBDDDPSTYrsquo13][ScaLAPACK] [BBDDDPSTYrsquo13] L nP12

                                                                      LU [ScaLAPACK][GDXrsquo11][Trsquo99][SDrsquo11] [GDXrsquo11][Trsquo99][SDrsquo11] L nP12

                                                                      QR [ScaLAPACK][DGHLrsquo12] [Trsquo99] [DGHLrsquo12][Trsquo99] L nP12

                                                                      Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

                                                                      Sym Eig amp SVD [BDDrsquo11][BDKrsquo13][ScaLAPACK] [BDDrsquo11][BDKrsquo13] L nP12

                                                                      Non-Sym Eig [BDDrsquo11] [BDDrsquo11] BW P12 L n

                                                                      Attaining with extra memory 25D M=(cn2P)

                                                                      Attaining the Lower bounds Parallel 2DM=(n2P)(Ignoring poly-log(P) factors words = ( n2 P12) messages = (P12)

                                                                      Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                      ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                      ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                      bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                      bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                      Avoiding Communication in Iterative Linear Algebra

                                                                      bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

                                                                      bull Conjugate Gradients (CG) GMRES Lanczos Arnoldi hellip bull Goal minimize communication

                                                                      ndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

                                                                      bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

                                                                      ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

                                                                      bull Lots of speed up possible (modeled and measured)ndash Price some redundant computationndash Challenges Poor partitioning Preconditioning Num Stability

                                                                      75

                                                                      Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                      ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                      ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                      bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                      bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                      Example The Difficulty of Tuning SpMV

                                                                      bull n = 21200bull nnz = 15 M

                                                                      bull Source NASA structural analysis problem (raefsky)

                                                                      77

                                                                      Example The Difficulty of Tuning

                                                                      bull n = 21200bull nnz = 15 M

                                                                      bull Source NASA structural analysis problem (raefsky)

                                                                      bull 8x8 dense substructure exploit this to limit mem_refs

                                                                      78

                                                                      Speedups on Itanium 2 The Need for Search

                                                                      Reference

                                                                      Best 4x2

                                                                      Mflops

                                                                      Mflops

                                                                      79

                                                                      Register Profile Itanium 2

                                                                      190 Mflops

                                                                      1190 Mflops

                                                                      80

                                                                      Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16

                                                                      Itanium 2 - 33Itanium 1 - 8

                                                                      252 Mflops

                                                                      122 Mflops

                                                                      820 Mflops

                                                                      459 Mflops

                                                                      247 Mflops

                                                                      107 Mflops

                                                                      12 Gflops

                                                                      190 Mflops

                                                                      Another example of tuning challenges for SpMV

                                                                      bull Ex11 matrix (fluid flow)

                                                                      bull More complicated non-zero structure in general

                                                                      bull N = 16614bull NNZ = 11M

                                                                      82

                                                                      Zoom in to top corner

                                                                      bull More complicated non-zero structure in general

                                                                      bull N = 16614bull NNZ = 11M

                                                                      83

                                                                      3x3 blocks look natural buthellip

                                                                      bull Example 3x3 blockingndash Logical grid of 3x3 cells

                                                                      bull But would lead to lots of ldquofill-inrdquo

                                                                      84

                                                                      Extra Work Can Improve Efficiency

                                                                      bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

                                                                      bull On Pentium III 15x speedupndash Actual mflop rate 152 = 225 higher

                                                                      85

                                                                      Source Accelerator Cavity Design Problem (Ko via Husbands)

                                                                      86

                                                                      100x100 Submatrix Along Diagonal

                                                                      Summer School Lecture 787

                                                                      Post-RCM Reordering

                                                                      88

                                                                      Effect of Combined RCM+TSP Reordering

                                                                      Before Green + RedAfter Green + Blue

                                                                      Summer School Lecture 789

                                                                      2x speedups on Pentium 4 Power 4 hellip

                                                                      Summary of Other Performance Optimizations

                                                                      bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

                                                                      bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

                                                                      bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

                                                                      90

                                                                      Optimized Sparse Kernel Interface - OSKI

                                                                      bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

                                                                      bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

                                                                      bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

                                                                      software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

                                                                      91

                                                                      Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                      ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                      ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                      bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                      bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                      93

                                                                      Example Classical Conjugate Gradient (CG)

                                                                      SpMVs and dot products require communication in

                                                                      each iteration

                                                                      via CA Matrix Powers Kernel

                                                                      Global reduction to compute G

                                                                      94

                                                                      Example CA-Conjugate Gradient

                                                                      Local computations within inner loop require

                                                                      no communication

                                                                      Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                      ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                      ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                      bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                      bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                      96

                                                                      Slower convergence due

                                                                      to roundoff

                                                                      Loss of accuracy due to roundoff

                                                                      At s = 16 monomial basis is rank deficient Method breaks down

                                                                      Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

                                                                      CA-CG (monomial)CG

                                                                      machine precision

                                                                      97

                                                                      Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                      ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                      ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                      bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                      bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                      What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

                                                                      bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

                                                                      bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

                                                                      matrices

                                                                      Explicit (O(nnz)) Implicit (o(nnz))

                                                                      Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

                                                                      Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

                                                                      Indices

                                                                      Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                      ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                      ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                      bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                      bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                      101

                                                                      bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

                                                                      ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

                                                                      demanded by customers (construction engineers) otherwise they donrsquot believe results

                                                                      ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

                                                                      ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

                                                                      Reproducible Floating Point Computation

                                                                      Absolute Error for Random Vectors

                                                                      Same magnitude opposite signs

                                                                      Intel MKL non-reproducibility

                                                                      Relative Error for Orthogonal vectors

                                                                      Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

                                                                      Sign notreproducible

                                                                      103

                                                                      bull Consider summation or dot productbull Goals

                                                                      1 Same answer independent of layout processors order of summands

                                                                      2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

                                                                      bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

                                                                      GoalsApproaches for Reproducibility

                                                                      104

                                                                      Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

                                                                      Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

                                                                      Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

                                                                      bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

                                                                      Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

                                                                      bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

                                                                      Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

                                                                      Instruments NEC Nokia NVIDIA Samsung Oracle

                                                                      bull bebopcsberkeleyedu

                                                                      Summary

                                                                      Donrsquot Communichellip

                                                                      106

                                                                      Time to redesign all linear algebra n-body hellip algorithms and software

                                                                      (and compilers)

                                                                      • Implementing Communication-Avoiding Algorithms
                                                                      • Why avoid communication
                                                                      • Goals
                                                                      • Outline
                                                                      • Outline (2)
                                                                      • Lower bound for all ldquon3-likerdquo linear algebra
                                                                      • Lower bound for all ldquon3-likerdquo linear algebra (2)
                                                                      • Lower bound for all ldquon3-likerdquo linear algebra (3)
                                                                      • Limits to parallel scaling (12)
                                                                      • Limits to parallel scaling (22)
                                                                      • Can we attain these lower bounds
                                                                      • Outline (3)
                                                                      • 25D Matrix Multiplication
                                                                      • 25D Matrix Multiplication (2)
                                                                      • 25D Matmul on BGP 16K nodes 64K cores (2)
                                                                      • Perfect Strong Scaling ndash in Time and Energy (12)
                                                                      • Perfect Strong Scaling ndash in Time and Energy (22)
                                                                      • Handling Heterogeneity
                                                                      • Application to Tensor Contractions
                                                                      • C(ijk) = Σm A(ijm)B(mk)
                                                                      • Application to Tensor Contractions (2)
                                                                      • Communication Lower Bounds for Strassen-like matmul algorithms
                                                                      • vs
                                                                      • Slide 26
                                                                      • Strassen-like beyond matmul
                                                                      • Cache and Network Oblivious Algorithms
                                                                      • CARMA Performance Distributed Memory
                                                                      • CARMA Performance Distributed Memory (2)
                                                                      • CARMA Performance Shared Memory
                                                                      • CARMA Performance Shared Memory (2)
                                                                      • Why is CARMA Faster in Shared Memory
                                                                      • Outline (4)
                                                                      • One-sided Factorizations (LU QR) so far
                                                                      • TSQR An Architecture-Dependent Algorithm
                                                                      • Back to LU Using similar idea for TSLU as TSQR Use reduction
                                                                      • Minimizing Communication in TSLU
                                                                      • Making TSLU Numerically Stable
                                                                      • Stability of LU using TSLU CALU
                                                                      • Why is stability of TSLU just a ldquoThmrdquo
                                                                      • Fixing TSLU
                                                                      • 2D CALU with Tournament Pivoting
                                                                      • 25D CALU with Tournament Pivoting (c=4 copies)
                                                                      • Exascale Machine Parameters Source DOE Exascale Workshop
                                                                      • Exascale predicted speedups for Gaussian Elimination 2D CA
                                                                      • 25D vs 2D LU With and Without Pivoting
                                                                      • Other CA algorithms for Ax=b least squares(13)
                                                                      • Other CA algorithms for Ax=b least squares (23)
                                                                      • Other CA algorithms for Ax=b least squares (33)
                                                                      • Outline (5)
                                                                      • What about sparse matrices (13)
                                                                      • Performance of 25D APSP using Kleene
                                                                      • What about sparse matrices (23)
                                                                      • What about sparse matrices (33)
                                                                      • Outline (6)
                                                                      • Symmetric Eigenproblem and SVD
                                                                      • Slide 58
                                                                      • Slide 59
                                                                      • Slide 60
                                                                      • Slide 61
                                                                      • Slide 62
                                                                      • Slide 63
                                                                      • Slide 64
                                                                      • Slide 65
                                                                      • Slide 66
                                                                      • Slide 67
                                                                      • Slide 68
                                                                      • Conventional vs CA - SBR
                                                                      • Speedups of Sym Band Reduction vs DSBTRD
                                                                      • Nonsymmetric Eigenproblem
                                                                      • Attaining the Lower bounds Sequential
                                                                      • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
                                                                      • Outline (7)
                                                                      • Avoiding Communication in Iterative Linear Algebra
                                                                      • Outline (8)
                                                                      • Example The Difficulty of Tuning SpMV
                                                                      • Example The Difficulty of Tuning
                                                                      • Speedups on Itanium 2 The Need for Search
                                                                      • Register Profile Itanium 2
                                                                      • Register Profiles IBM and Intel IA-64
                                                                      • Another example of tuning challenges for SpMV
                                                                      • Zoom in to top corner
                                                                      • 3x3 blocks look natural buthellip
                                                                      • Extra Work Can Improve Efficiency
                                                                      • Slide 86
                                                                      • Slide 87
                                                                      • Slide 88
                                                                      • Slide 89
                                                                      • Summary of Other Performance Optimizations
                                                                      • Optimized Sparse Kernel Interface - OSKI
                                                                      • Outline (9)
                                                                      • Example Classical Conjugate Gradient (CG)
                                                                      • Example CA-Conjugate Gradient
                                                                      • Outline (10)
                                                                      • Slide 96
                                                                      • Slide 97
                                                                      • Outline (11)
                                                                      • What is a ldquosparse matrixrdquo
                                                                      • Outline (12)
                                                                      • Reproducible Floating Point Computation
                                                                      • Intel MKL non-reproducibility
                                                                      • GoalsApproaches for Reproducibility
                                                                      • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
                                                                      • Collaborators and Supporters
                                                                      • Summary

                                                                        Minimizing Communication in TSLU

                                                                        W = W1

                                                                        W2

                                                                        W3

                                                                        W4

                                                                        LULULULU

                                                                        LU

                                                                        LULUParallel

                                                                        W = W1

                                                                        W2

                                                                        W3

                                                                        W4

                                                                        LULU

                                                                        LU

                                                                        LUSequentialStreaming

                                                                        W = W1

                                                                        W2

                                                                        W3

                                                                        W4

                                                                        LULU LU

                                                                        LULU

                                                                        LULU

                                                                        Dual Core

                                                                        Can choose reduction tree dynamically to match architecture as before

                                                                        38

                                                                        Making TSLU Numerically Stable

                                                                        bull Details matterndash Going up the tree we could do LU either on original rows of A

                                                                        (tournament pivoting) or computed rows of Undash Only tournament pivoting stable

                                                                        bull ldquoThmrdquo New scheme as stable as Partial Pivoting (GEPP) in following sense Get same Schur complements as GEPP applied to different input matrix whose entries are blocks taken from input A

                                                                        bull Why just a ldquoThmrdquo

                                                                        39

                                                                        Stability of LU using TSLU CALU

                                                                        Summer School Lecture 4 40

                                                                        bull Empirical testingndash Both random matrices and ldquospecial onesrdquondash Both binary tree (BCALU) and flat-tree (FCALU)ndash 3 metrics ||PA-LU||||A|| normwise and componentwise backward errorsndash See [D Grigori Xiang 2010] for details

                                                                        Why is stability of TSLU just a ldquoThmrdquo

                                                                        bull Proof is correct ndash in exact arithmeticbull Experiment

                                                                        ndash Generate 100 random 6x6 rank 3 matrices in Matlabndash [LUP] = lu(A) do LU without pivoting on PA compare L factors are

                                                                        they the samebull Compute || L ndash Lnp || A few 0rsquos A few infinrsquos a few NaNsbull Rest mostly O(1)

                                                                        ndash Why Floating point is nonassociative doing arithmetic in different order gives different rounding errors

                                                                        ndash Same experiment with rank 6 matrices || L ndash Lnp || usually nonzero O(macheps)

                                                                        ndash Same experiment with 20x20 rank 4 matrices || L ndash Lnp || often O(103)

                                                                        bull Much harder to break TSLU but possiblendash Occurred when using TSLU to factorize a low-rank subdiagonal

                                                                        panel in symmetric-indefinite factorization 41

                                                                        Fixing TSLU

                                                                        bull Run TSLU quickly test for stability fix if necessary (rare)

                                                                        bull Test conditioning of U if not tiny (usual case) proceed elsebull Compute || L || if not big (usual case) proceed elsebull Factor A = QR using TSQR thenbull Factor Q = PLU using TSLU thenbull A = PL(UR) with UR as upper triangular factor

                                                                        bull Last topic in lecture how to guarantee floating point reproducibility

                                                                        42

                                                                        2D CALU with Tournament Pivoting

                                                                        43

                                                                        25D CALU with Tournament Pivoting (c=4 copies)

                                                                        44

                                                                        Exascale Machine ParametersSource DOE Exascale Workshop

                                                                        bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

                                                                        Exascale predicted speedupsfor Gaussian Elimination

                                                                        2D CA-LU vs ScaLAPACK-LU

                                                                        log2 (p)

                                                                        log 2

                                                                        (n2 p

                                                                        ) =

                                                                        log 2

                                                                        (mem

                                                                        ory_

                                                                        per_

                                                                        proc

                                                                        )

                                                                        Up to 29x

                                                                        25D vs 2D LUWith and Without Pivoting

                                                                        Other CA algorithms for Ax=b least squares(13)

                                                                        bull A symmetric and indefinitendash Seek factorization that retains symmetry PAPT = LDLT D

                                                                        ldquosimplerdquobull Save frac12 flops preserve inertia

                                                                        ndash Usual approach Bunch-Kaufmanbull D block diagonal with 1x1 and 2x2 blocksbull Pivot search down column along row (lots of communication)

                                                                        ndash Alternative Aasenbull D = tridiagonal = Tbull Two steps

                                                                        ndash PAPT = LTLT where T is banded using TSLU

                                                                        48

                                                                        0 0

                                                                        0

                                                                        0 0

                                                                        0

                                                                        0

                                                                        hellip

                                                                        hellip

                                                                        ndash Solvefactor narrow band problem with Tbull Up to 28x faster than MKL Best Paper at IPDPSrsquo13

                                                                        Other CA algorithms for Ax=b least squares (23)bull Minimizing bandwidth and latency for sequential GEPP

                                                                        ndash So far could not do partial pivoting and minimize messages just words

                                                                        ndash Challengebull Column layout good for choosing pivots bad for matmulbull Blocked layout good for matmul bad for choosing pivots

                                                                        ndash Solution use both layouts switching between thembull ldquoShape Morphing LUrdquo or SMLU

                                                                        49

                                                                        bull func factor(A) if A has 1 column update it else factor(left half of A)

                                                                        update right half of A

                                                                        factor(right half of A)

                                                                        bull Words = O(n3M12)

                                                                        bull Messages = O(n3M)

                                                                        bull func factor(A) if A has 1 column update it else factor(left half of A) reshape to recursive block format update right half of A reshape to columnwise format factor(right half of A)

                                                                        bull Words = O(n3M12)

                                                                        bull Messages = O(n3M32)

                                                                        Other CA algorithms for Ax=b least squares (33)bull Need for pivoting arises beyond LU in QR

                                                                        ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)

                                                                        ndash Usual approach like Partial Pivotingbull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound

                                                                        ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two

                                                                        groups of b columns either using usual approach or something better (GuEisenstat)

                                                                        bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix

                                                                        ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 50

                                                                        Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                        ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                        ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                        bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                        bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                        What about sparse matrices (13)

                                                                        bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

                                                                        bull But canrsquot reorder outer loop for 25D need another ideabull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = AB

                                                                        ndash Dependencies ok 25D works just different semiringbull Kleenersquos Algorithm

                                                                        52

                                                                        for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

                                                                        D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

                                                                        Performance of 25D APSP using Kleene

                                                                        53

                                                                        Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

                                                                        62xspeedup

                                                                        2x speedup

                                                                        What about sparse matrices (23)

                                                                        bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of G(A)

                                                                        have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

                                                                        bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering for

                                                                        2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

                                                                        bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

                                                                        multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

                                                                        separators)

                                                                        54

                                                                        What about sparse matrices (33)

                                                                        bull If matrix stays very sparse lower bound unattainable new one bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

                                                                        data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

                                                                        bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

                                                                        Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

                                                                        bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

                                                                        along dimensions most likely to minimize cost55

                                                                        Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                        ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                        ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                        bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                        bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                        Symmetric Eigenproblem and SVD

                                                                        bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd

                                                                        bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12

                                                                        ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea

                                                                        b+1

                                                                        b+1

                                                                        Successive Band Reduction (BischofLangSun)

                                                                        1

                                                                        b+1

                                                                        b+1

                                                                        d+1

                                                                        c

                                                                        Successive Band Reduction (BischofLangSun)

                                                                        b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                        1Q1

                                                                        b+1

                                                                        b+1

                                                                        d+1

                                                                        c

                                                                        b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                        Successive Band Reduction (BischofLangSun)

                                                                        12

                                                                        Q1

                                                                        b+1

                                                                        b+1

                                                                        d+1

                                                                        d+c

                                                                        d+c

                                                                        c

                                                                        b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                        Successive Band Reduction (BischofLangSun)

                                                                        1

                                                                        12

                                                                        Q1

                                                                        Q1T

                                                                        b+1

                                                                        b+1

                                                                        d+1

                                                                        d+1

                                                                        cd+c

                                                                        d+c

                                                                        c

                                                                        b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                        Successive Band Reduction (BischofLangSun)

                                                                        1

                                                                        1

                                                                        2

                                                                        2Q1

                                                                        Q1T

                                                                        b+1

                                                                        b+1

                                                                        d+1

                                                                        d+1

                                                                        cd+c

                                                                        d+c

                                                                        d+c

                                                                        d+c

                                                                        c

                                                                        b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                        Successive Band Reduction (BischofLangSun)

                                                                        1

                                                                        1

                                                                        2

                                                                        2

                                                                        3

                                                                        3

                                                                        Q1

                                                                        Q1T

                                                                        Q2

                                                                        Q2T

                                                                        b+1

                                                                        b+1

                                                                        d+1

                                                                        d+1

                                                                        d+c

                                                                        d+c

                                                                        d+c

                                                                        d+c

                                                                        c

                                                                        c

                                                                        b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                        Successive Band Reduction (BischofLangSun)

                                                                        1

                                                                        1

                                                                        2

                                                                        2

                                                                        3

                                                                        3

                                                                        4

                                                                        4

                                                                        Q1

                                                                        Q1T

                                                                        Q2

                                                                        Q2T

                                                                        Q3

                                                                        Q3T

                                                                        b+1

                                                                        b+1

                                                                        d+1

                                                                        d+1

                                                                        d+c

                                                                        d+c

                                                                        d+c

                                                                        d+c

                                                                        c

                                                                        c

                                                                        b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                        Successive Band Reduction (BischofLangSun)

                                                                        1

                                                                        1

                                                                        2

                                                                        2

                                                                        3

                                                                        3

                                                                        4

                                                                        4

                                                                        5

                                                                        5

                                                                        Q1

                                                                        Q1T

                                                                        Q2

                                                                        Q2T

                                                                        Q3

                                                                        Q3T

                                                                        Q4

                                                                        Q4T

                                                                        b+1

                                                                        b+1

                                                                        d+1

                                                                        d+1

                                                                        c

                                                                        c

                                                                        d+c

                                                                        d+c

                                                                        d+c

                                                                        d+c

                                                                        b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                        Successive Band Reduction (BischofLangSun)

                                                                        1

                                                                        1

                                                                        2

                                                                        2

                                                                        3

                                                                        3

                                                                        4

                                                                        4

                                                                        5

                                                                        5

                                                                        Q5T

                                                                        Q1

                                                                        Q1T

                                                                        Q2

                                                                        Q2T

                                                                        Q3

                                                                        Q3T

                                                                        Q5

                                                                        Q4

                                                                        Q4T

                                                                        b+1

                                                                        b+1

                                                                        d+1

                                                                        d+1

                                                                        c

                                                                        c

                                                                        d+c

                                                                        d+c

                                                                        d+c

                                                                        d+c

                                                                        b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                        Successive Band Reduction (BischofLangSun)

                                                                        1

                                                                        1

                                                                        2

                                                                        2

                                                                        3

                                                                        3

                                                                        4

                                                                        4

                                                                        5

                                                                        5

                                                                        6

                                                                        6

                                                                        Q5T

                                                                        Q1

                                                                        Q1T

                                                                        Q2

                                                                        Q2T

                                                                        Q3

                                                                        Q3T

                                                                        Q5

                                                                        Q4

                                                                        Q4T

                                                                        b+1

                                                                        b+1

                                                                        d+1

                                                                        d+1

                                                                        c

                                                                        c

                                                                        d+c

                                                                        d+c

                                                                        d+c

                                                                        d+c

                                                                        b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                        Successive Band Reduction (BischofLangSun)

                                                                        Conventional vs CA - SBR

                                                                        Conventional Communication-Avoiding

                                                                        Touch all data 4 times Touch all data once

                                                                        >
                                                                        >

                                                                        Speedups of Sym Band Reductionvs DSBTRD

                                                                        bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

                                                                        bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

                                                                        bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

                                                                        bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

                                                                        bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

                                                                        Nonsymmetric Eigenproblem

                                                                        bull No apparent way to modify standard algorithmbull Instead Spectral Divide-and-Conquer

                                                                        ndash Find orthogonal matrix Q whose leading columns span an invariant subspace of A

                                                                        ndash QTAQ will be block upper triangularndash Apply recursively to A11 A22

                                                                        ndash Depends on randomization1 Randomized Rank Revealing QR decomposition2 Randomized location to try splitting spectrum

                                                                        A11 A12

                                                                        ε A22

                                                                        Attaining the Lower bounds SequentialLegend[Existing][Ours][Math-Lib][Random]

                                                                        Two Levels Memory Hierarchy

                                                                        Words Messages Words Messages

                                                                        BLAS-3 [FLPRrsquo99][BDLSTrsquo13][MKL etc] [FLPRrsquo99][BDLSTrsquo13][MKL etc]

                                                                        Cholesky[Grsquo97][APrsquo00]

                                                                        [LAPACK][BDHSrsquo09]

                                                                        [Grsquo97][APrsquo00][BDHSrsquo09] [Grsquo97][APrsquo00][BDHSrsquo09]

                                                                        Sym Indefinite [BBDDDPSTYrsquo13] [BBDDDPSTYrsquo13]

                                                                        LU[Grsquo97][Trsquo97]

                                                                        [GDXrsquo11][BDLSTrsquo13]

                                                                        [GDXrsquo11][BDLSTrsquo13]

                                                                        [Grsquo97][Trsquo97] [BDLSTrsquo13] [BDLSTrsquo13]

                                                                        QR[EGrsquo98][FWrsquo03]

                                                                        [DGHLrsquo12][BDLSTrsquo13]

                                                                        [FWrsquo03][DGHLrsquo12][BDLSTrsquo13]

                                                                        [EGrsquo98][FWrsquo03][BDLSTrsquo13]

                                                                        [FWrsquo03][BDLSTrsquo13]

                                                                        Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

                                                                        Sym Eig amp SVD [BDDrsquo11][BDKrsquo13] [BDDrsquo11]

                                                                        Non Sym Eig [BDDrsquo11] [BDDrsquo11]

                                                                        Legend[Existing][Ours][Math-Lib][Random]

                                                                        Words (BW) Messages (L) Saving factor

                                                                        BLAS-3 [AGZrsquo94][MTrsquo99][ScaLAPACK][Crsquo69][vGWrsquo97][SDrsquo11] L nP12

                                                                        Cholesky [ScaLAPACK][Trsquo99][SDrsquo11] L nP12

                                                                        Sym Indefinite [BBDDDPSTYrsquo13][ScaLAPACK] [BBDDDPSTYrsquo13] L nP12

                                                                        LU [ScaLAPACK][GDXrsquo11][Trsquo99][SDrsquo11] [GDXrsquo11][Trsquo99][SDrsquo11] L nP12

                                                                        QR [ScaLAPACK][DGHLrsquo12] [Trsquo99] [DGHLrsquo12][Trsquo99] L nP12

                                                                        Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

                                                                        Sym Eig amp SVD [BDDrsquo11][BDKrsquo13][ScaLAPACK] [BDDrsquo11][BDKrsquo13] L nP12

                                                                        Non-Sym Eig [BDDrsquo11] [BDDrsquo11] BW P12 L n

                                                                        Attaining with extra memory 25D M=(cn2P)

                                                                        Attaining the Lower bounds Parallel 2DM=(n2P)(Ignoring poly-log(P) factors words = ( n2 P12) messages = (P12)

                                                                        Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                        ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                        ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                        bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                        bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                        Avoiding Communication in Iterative Linear Algebra

                                                                        bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

                                                                        bull Conjugate Gradients (CG) GMRES Lanczos Arnoldi hellip bull Goal minimize communication

                                                                        ndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

                                                                        bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

                                                                        ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

                                                                        bull Lots of speed up possible (modeled and measured)ndash Price some redundant computationndash Challenges Poor partitioning Preconditioning Num Stability

                                                                        75

                                                                        Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                        ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                        ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                        bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                        bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                        Example The Difficulty of Tuning SpMV

                                                                        bull n = 21200bull nnz = 15 M

                                                                        bull Source NASA structural analysis problem (raefsky)

                                                                        77

                                                                        Example The Difficulty of Tuning

                                                                        bull n = 21200bull nnz = 15 M

                                                                        bull Source NASA structural analysis problem (raefsky)

                                                                        bull 8x8 dense substructure exploit this to limit mem_refs

                                                                        78

                                                                        Speedups on Itanium 2 The Need for Search

                                                                        Reference

                                                                        Best 4x2

                                                                        Mflops

                                                                        Mflops

                                                                        79

                                                                        Register Profile Itanium 2

                                                                        190 Mflops

                                                                        1190 Mflops

                                                                        80

                                                                        Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16

                                                                        Itanium 2 - 33Itanium 1 - 8

                                                                        252 Mflops

                                                                        122 Mflops

                                                                        820 Mflops

                                                                        459 Mflops

                                                                        247 Mflops

                                                                        107 Mflops

                                                                        12 Gflops

                                                                        190 Mflops

                                                                        Another example of tuning challenges for SpMV

                                                                        bull Ex11 matrix (fluid flow)

                                                                        bull More complicated non-zero structure in general

                                                                        bull N = 16614bull NNZ = 11M

                                                                        82

                                                                        Zoom in to top corner

                                                                        bull More complicated non-zero structure in general

                                                                        bull N = 16614bull NNZ = 11M

                                                                        83

                                                                        3x3 blocks look natural buthellip

                                                                        bull Example 3x3 blockingndash Logical grid of 3x3 cells

                                                                        bull But would lead to lots of ldquofill-inrdquo

                                                                        84

                                                                        Extra Work Can Improve Efficiency

                                                                        bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

                                                                        bull On Pentium III 15x speedupndash Actual mflop rate 152 = 225 higher

                                                                        85

                                                                        Source Accelerator Cavity Design Problem (Ko via Husbands)

                                                                        86

                                                                        100x100 Submatrix Along Diagonal

                                                                        Summer School Lecture 787

                                                                        Post-RCM Reordering

                                                                        88

                                                                        Effect of Combined RCM+TSP Reordering

                                                                        Before Green + RedAfter Green + Blue

                                                                        Summer School Lecture 789

                                                                        2x speedups on Pentium 4 Power 4 hellip

                                                                        Summary of Other Performance Optimizations

                                                                        bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

                                                                        bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

                                                                        bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

                                                                        90

                                                                        Optimized Sparse Kernel Interface - OSKI

                                                                        bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

                                                                        bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

                                                                        bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

                                                                        software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

                                                                        91

                                                                        Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                        ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                        ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                        bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                        bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                        93

                                                                        Example Classical Conjugate Gradient (CG)

                                                                        SpMVs and dot products require communication in

                                                                        each iteration

                                                                        via CA Matrix Powers Kernel

                                                                        Global reduction to compute G

                                                                        94

                                                                        Example CA-Conjugate Gradient

                                                                        Local computations within inner loop require

                                                                        no communication

                                                                        Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                        ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                        ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                        bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                        bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                        96

                                                                        Slower convergence due

                                                                        to roundoff

                                                                        Loss of accuracy due to roundoff

                                                                        At s = 16 monomial basis is rank deficient Method breaks down

                                                                        Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

                                                                        CA-CG (monomial)CG

                                                                        machine precision

                                                                        97

                                                                        Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                        ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                        ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                        bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                        bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                        What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

                                                                        bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

                                                                        bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

                                                                        matrices

                                                                        Explicit (O(nnz)) Implicit (o(nnz))

                                                                        Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

                                                                        Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

                                                                        Indices

                                                                        Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                        ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                        ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                        bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                        bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                        101

                                                                        bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

                                                                        ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

                                                                        demanded by customers (construction engineers) otherwise they donrsquot believe results

                                                                        ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

                                                                        ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

                                                                        Reproducible Floating Point Computation

                                                                        Absolute Error for Random Vectors

                                                                        Same magnitude opposite signs

                                                                        Intel MKL non-reproducibility

                                                                        Relative Error for Orthogonal vectors

                                                                        Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

                                                                        Sign notreproducible

                                                                        103

                                                                        bull Consider summation or dot productbull Goals

                                                                        1 Same answer independent of layout processors order of summands

                                                                        2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

                                                                        bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

                                                                        GoalsApproaches for Reproducibility

                                                                        104

                                                                        Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

                                                                        Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

                                                                        Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

                                                                        bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

                                                                        Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

                                                                        bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

                                                                        Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

                                                                        Instruments NEC Nokia NVIDIA Samsung Oracle

                                                                        bull bebopcsberkeleyedu

                                                                        Summary

                                                                        Donrsquot Communichellip

                                                                        106

                                                                        Time to redesign all linear algebra n-body hellip algorithms and software

                                                                        (and compilers)

                                                                        • Implementing Communication-Avoiding Algorithms
                                                                        • Why avoid communication
                                                                        • Goals
                                                                        • Outline
                                                                        • Outline (2)
                                                                        • Lower bound for all ldquon3-likerdquo linear algebra
                                                                        • Lower bound for all ldquon3-likerdquo linear algebra (2)
                                                                        • Lower bound for all ldquon3-likerdquo linear algebra (3)
                                                                        • Limits to parallel scaling (12)
                                                                        • Limits to parallel scaling (22)
                                                                        • Can we attain these lower bounds
                                                                        • Outline (3)
                                                                        • 25D Matrix Multiplication
                                                                        • 25D Matrix Multiplication (2)
                                                                        • 25D Matmul on BGP 16K nodes 64K cores (2)
                                                                        • Perfect Strong Scaling ndash in Time and Energy (12)
                                                                        • Perfect Strong Scaling ndash in Time and Energy (22)
                                                                        • Handling Heterogeneity
                                                                        • Application to Tensor Contractions
                                                                        • C(ijk) = Σm A(ijm)B(mk)
                                                                        • Application to Tensor Contractions (2)
                                                                        • Communication Lower Bounds for Strassen-like matmul algorithms
                                                                        • vs
                                                                        • Slide 26
                                                                        • Strassen-like beyond matmul
                                                                        • Cache and Network Oblivious Algorithms
                                                                        • CARMA Performance Distributed Memory
                                                                        • CARMA Performance Distributed Memory (2)
                                                                        • CARMA Performance Shared Memory
                                                                        • CARMA Performance Shared Memory (2)
                                                                        • Why is CARMA Faster in Shared Memory
                                                                        • Outline (4)
                                                                        • One-sided Factorizations (LU QR) so far
                                                                        • TSQR An Architecture-Dependent Algorithm
                                                                        • Back to LU Using similar idea for TSLU as TSQR Use reduction
                                                                        • Minimizing Communication in TSLU
                                                                        • Making TSLU Numerically Stable
                                                                        • Stability of LU using TSLU CALU
                                                                        • Why is stability of TSLU just a ldquoThmrdquo
                                                                        • Fixing TSLU
                                                                        • 2D CALU with Tournament Pivoting
                                                                        • 25D CALU with Tournament Pivoting (c=4 copies)
                                                                        • Exascale Machine Parameters Source DOE Exascale Workshop
                                                                        • Exascale predicted speedups for Gaussian Elimination 2D CA
                                                                        • 25D vs 2D LU With and Without Pivoting
                                                                        • Other CA algorithms for Ax=b least squares(13)
                                                                        • Other CA algorithms for Ax=b least squares (23)
                                                                        • Other CA algorithms for Ax=b least squares (33)
                                                                        • Outline (5)
                                                                        • What about sparse matrices (13)
                                                                        • Performance of 25D APSP using Kleene
                                                                        • What about sparse matrices (23)
                                                                        • What about sparse matrices (33)
                                                                        • Outline (6)
                                                                        • Symmetric Eigenproblem and SVD
                                                                        • Slide 58
                                                                        • Slide 59
                                                                        • Slide 60
                                                                        • Slide 61
                                                                        • Slide 62
                                                                        • Slide 63
                                                                        • Slide 64
                                                                        • Slide 65
                                                                        • Slide 66
                                                                        • Slide 67
                                                                        • Slide 68
                                                                        • Conventional vs CA - SBR
                                                                        • Speedups of Sym Band Reduction vs DSBTRD
                                                                        • Nonsymmetric Eigenproblem
                                                                        • Attaining the Lower bounds Sequential
                                                                        • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
                                                                        • Outline (7)
                                                                        • Avoiding Communication in Iterative Linear Algebra
                                                                        • Outline (8)
                                                                        • Example The Difficulty of Tuning SpMV
                                                                        • Example The Difficulty of Tuning
                                                                        • Speedups on Itanium 2 The Need for Search
                                                                        • Register Profile Itanium 2
                                                                        • Register Profiles IBM and Intel IA-64
                                                                        • Another example of tuning challenges for SpMV
                                                                        • Zoom in to top corner
                                                                        • 3x3 blocks look natural buthellip
                                                                        • Extra Work Can Improve Efficiency
                                                                        • Slide 86
                                                                        • Slide 87
                                                                        • Slide 88
                                                                        • Slide 89
                                                                        • Summary of Other Performance Optimizations
                                                                        • Optimized Sparse Kernel Interface - OSKI
                                                                        • Outline (9)
                                                                        • Example Classical Conjugate Gradient (CG)
                                                                        • Example CA-Conjugate Gradient
                                                                        • Outline (10)
                                                                        • Slide 96
                                                                        • Slide 97
                                                                        • Outline (11)
                                                                        • What is a ldquosparse matrixrdquo
                                                                        • Outline (12)
                                                                        • Reproducible Floating Point Computation
                                                                        • Intel MKL non-reproducibility
                                                                        • GoalsApproaches for Reproducibility
                                                                        • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
                                                                        • Collaborators and Supporters
                                                                        • Summary

                                                                          Making TSLU Numerically Stable

                                                                          bull Details matterndash Going up the tree we could do LU either on original rows of A

                                                                          (tournament pivoting) or computed rows of Undash Only tournament pivoting stable

                                                                          bull ldquoThmrdquo New scheme as stable as Partial Pivoting (GEPP) in following sense Get same Schur complements as GEPP applied to different input matrix whose entries are blocks taken from input A

                                                                          bull Why just a ldquoThmrdquo

                                                                          39

                                                                          Stability of LU using TSLU CALU

                                                                          Summer School Lecture 4 40

                                                                          bull Empirical testingndash Both random matrices and ldquospecial onesrdquondash Both binary tree (BCALU) and flat-tree (FCALU)ndash 3 metrics ||PA-LU||||A|| normwise and componentwise backward errorsndash See [D Grigori Xiang 2010] for details

                                                                          Why is stability of TSLU just a ldquoThmrdquo

                                                                          bull Proof is correct ndash in exact arithmeticbull Experiment

                                                                          ndash Generate 100 random 6x6 rank 3 matrices in Matlabndash [LUP] = lu(A) do LU without pivoting on PA compare L factors are

                                                                          they the samebull Compute || L ndash Lnp || A few 0rsquos A few infinrsquos a few NaNsbull Rest mostly O(1)

                                                                          ndash Why Floating point is nonassociative doing arithmetic in different order gives different rounding errors

                                                                          ndash Same experiment with rank 6 matrices || L ndash Lnp || usually nonzero O(macheps)

                                                                          ndash Same experiment with 20x20 rank 4 matrices || L ndash Lnp || often O(103)

                                                                          bull Much harder to break TSLU but possiblendash Occurred when using TSLU to factorize a low-rank subdiagonal

                                                                          panel in symmetric-indefinite factorization 41

                                                                          Fixing TSLU

                                                                          bull Run TSLU quickly test for stability fix if necessary (rare)

                                                                          bull Test conditioning of U if not tiny (usual case) proceed elsebull Compute || L || if not big (usual case) proceed elsebull Factor A = QR using TSQR thenbull Factor Q = PLU using TSLU thenbull A = PL(UR) with UR as upper triangular factor

                                                                          bull Last topic in lecture how to guarantee floating point reproducibility

                                                                          42

                                                                          2D CALU with Tournament Pivoting

                                                                          43

                                                                          25D CALU with Tournament Pivoting (c=4 copies)

                                                                          44

                                                                          Exascale Machine ParametersSource DOE Exascale Workshop

                                                                          bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

                                                                          Exascale predicted speedupsfor Gaussian Elimination

                                                                          2D CA-LU vs ScaLAPACK-LU

                                                                          log2 (p)

                                                                          log 2

                                                                          (n2 p

                                                                          ) =

                                                                          log 2

                                                                          (mem

                                                                          ory_

                                                                          per_

                                                                          proc

                                                                          )

                                                                          Up to 29x

                                                                          25D vs 2D LUWith and Without Pivoting

                                                                          Other CA algorithms for Ax=b least squares(13)

                                                                          bull A symmetric and indefinitendash Seek factorization that retains symmetry PAPT = LDLT D

                                                                          ldquosimplerdquobull Save frac12 flops preserve inertia

                                                                          ndash Usual approach Bunch-Kaufmanbull D block diagonal with 1x1 and 2x2 blocksbull Pivot search down column along row (lots of communication)

                                                                          ndash Alternative Aasenbull D = tridiagonal = Tbull Two steps

                                                                          ndash PAPT = LTLT where T is banded using TSLU

                                                                          48

                                                                          0 0

                                                                          0

                                                                          0 0

                                                                          0

                                                                          0

                                                                          hellip

                                                                          hellip

                                                                          ndash Solvefactor narrow band problem with Tbull Up to 28x faster than MKL Best Paper at IPDPSrsquo13

                                                                          Other CA algorithms for Ax=b least squares (23)bull Minimizing bandwidth and latency for sequential GEPP

                                                                          ndash So far could not do partial pivoting and minimize messages just words

                                                                          ndash Challengebull Column layout good for choosing pivots bad for matmulbull Blocked layout good for matmul bad for choosing pivots

                                                                          ndash Solution use both layouts switching between thembull ldquoShape Morphing LUrdquo or SMLU

                                                                          49

                                                                          bull func factor(A) if A has 1 column update it else factor(left half of A)

                                                                          update right half of A

                                                                          factor(right half of A)

                                                                          bull Words = O(n3M12)

                                                                          bull Messages = O(n3M)

                                                                          bull func factor(A) if A has 1 column update it else factor(left half of A) reshape to recursive block format update right half of A reshape to columnwise format factor(right half of A)

                                                                          bull Words = O(n3M12)

                                                                          bull Messages = O(n3M32)

                                                                          Other CA algorithms for Ax=b least squares (33)bull Need for pivoting arises beyond LU in QR

                                                                          ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)

                                                                          ndash Usual approach like Partial Pivotingbull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound

                                                                          ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two

                                                                          groups of b columns either using usual approach or something better (GuEisenstat)

                                                                          bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix

                                                                          ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 50

                                                                          Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                          ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                          ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                          bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                          bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                          What about sparse matrices (13)

                                                                          bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

                                                                          bull But canrsquot reorder outer loop for 25D need another ideabull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = AB

                                                                          ndash Dependencies ok 25D works just different semiringbull Kleenersquos Algorithm

                                                                          52

                                                                          for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

                                                                          D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

                                                                          Performance of 25D APSP using Kleene

                                                                          53

                                                                          Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

                                                                          62xspeedup

                                                                          2x speedup

                                                                          What about sparse matrices (23)

                                                                          bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of G(A)

                                                                          have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

                                                                          bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering for

                                                                          2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

                                                                          bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

                                                                          multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

                                                                          separators)

                                                                          54

                                                                          What about sparse matrices (33)

                                                                          bull If matrix stays very sparse lower bound unattainable new one bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

                                                                          data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

                                                                          bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

                                                                          Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

                                                                          bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

                                                                          along dimensions most likely to minimize cost55

                                                                          Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                          ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                          ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                          bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                          bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                          Symmetric Eigenproblem and SVD

                                                                          bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd

                                                                          bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12

                                                                          ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea

                                                                          b+1

                                                                          b+1

                                                                          Successive Band Reduction (BischofLangSun)

                                                                          1

                                                                          b+1

                                                                          b+1

                                                                          d+1

                                                                          c

                                                                          Successive Band Reduction (BischofLangSun)

                                                                          b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                          1Q1

                                                                          b+1

                                                                          b+1

                                                                          d+1

                                                                          c

                                                                          b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                          Successive Band Reduction (BischofLangSun)

                                                                          12

                                                                          Q1

                                                                          b+1

                                                                          b+1

                                                                          d+1

                                                                          d+c

                                                                          d+c

                                                                          c

                                                                          b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                          Successive Band Reduction (BischofLangSun)

                                                                          1

                                                                          12

                                                                          Q1

                                                                          Q1T

                                                                          b+1

                                                                          b+1

                                                                          d+1

                                                                          d+1

                                                                          cd+c

                                                                          d+c

                                                                          c

                                                                          b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                          Successive Band Reduction (BischofLangSun)

                                                                          1

                                                                          1

                                                                          2

                                                                          2Q1

                                                                          Q1T

                                                                          b+1

                                                                          b+1

                                                                          d+1

                                                                          d+1

                                                                          cd+c

                                                                          d+c

                                                                          d+c

                                                                          d+c

                                                                          c

                                                                          b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                          Successive Band Reduction (BischofLangSun)

                                                                          1

                                                                          1

                                                                          2

                                                                          2

                                                                          3

                                                                          3

                                                                          Q1

                                                                          Q1T

                                                                          Q2

                                                                          Q2T

                                                                          b+1

                                                                          b+1

                                                                          d+1

                                                                          d+1

                                                                          d+c

                                                                          d+c

                                                                          d+c

                                                                          d+c

                                                                          c

                                                                          c

                                                                          b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                          Successive Band Reduction (BischofLangSun)

                                                                          1

                                                                          1

                                                                          2

                                                                          2

                                                                          3

                                                                          3

                                                                          4

                                                                          4

                                                                          Q1

                                                                          Q1T

                                                                          Q2

                                                                          Q2T

                                                                          Q3

                                                                          Q3T

                                                                          b+1

                                                                          b+1

                                                                          d+1

                                                                          d+1

                                                                          d+c

                                                                          d+c

                                                                          d+c

                                                                          d+c

                                                                          c

                                                                          c

                                                                          b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                          Successive Band Reduction (BischofLangSun)

                                                                          1

                                                                          1

                                                                          2

                                                                          2

                                                                          3

                                                                          3

                                                                          4

                                                                          4

                                                                          5

                                                                          5

                                                                          Q1

                                                                          Q1T

                                                                          Q2

                                                                          Q2T

                                                                          Q3

                                                                          Q3T

                                                                          Q4

                                                                          Q4T

                                                                          b+1

                                                                          b+1

                                                                          d+1

                                                                          d+1

                                                                          c

                                                                          c

                                                                          d+c

                                                                          d+c

                                                                          d+c

                                                                          d+c

                                                                          b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                          Successive Band Reduction (BischofLangSun)

                                                                          1

                                                                          1

                                                                          2

                                                                          2

                                                                          3

                                                                          3

                                                                          4

                                                                          4

                                                                          5

                                                                          5

                                                                          Q5T

                                                                          Q1

                                                                          Q1T

                                                                          Q2

                                                                          Q2T

                                                                          Q3

                                                                          Q3T

                                                                          Q5

                                                                          Q4

                                                                          Q4T

                                                                          b+1

                                                                          b+1

                                                                          d+1

                                                                          d+1

                                                                          c

                                                                          c

                                                                          d+c

                                                                          d+c

                                                                          d+c

                                                                          d+c

                                                                          b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                          Successive Band Reduction (BischofLangSun)

                                                                          1

                                                                          1

                                                                          2

                                                                          2

                                                                          3

                                                                          3

                                                                          4

                                                                          4

                                                                          5

                                                                          5

                                                                          6

                                                                          6

                                                                          Q5T

                                                                          Q1

                                                                          Q1T

                                                                          Q2

                                                                          Q2T

                                                                          Q3

                                                                          Q3T

                                                                          Q5

                                                                          Q4

                                                                          Q4T

                                                                          b+1

                                                                          b+1

                                                                          d+1

                                                                          d+1

                                                                          c

                                                                          c

                                                                          d+c

                                                                          d+c

                                                                          d+c

                                                                          d+c

                                                                          b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                          Successive Band Reduction (BischofLangSun)

                                                                          Conventional vs CA - SBR

                                                                          Conventional Communication-Avoiding

                                                                          Touch all data 4 times Touch all data once

                                                                          >
                                                                          >

                                                                          Speedups of Sym Band Reductionvs DSBTRD

                                                                          bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

                                                                          bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

                                                                          bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

                                                                          bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

                                                                          bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

                                                                          Nonsymmetric Eigenproblem

                                                                          bull No apparent way to modify standard algorithmbull Instead Spectral Divide-and-Conquer

                                                                          ndash Find orthogonal matrix Q whose leading columns span an invariant subspace of A

                                                                          ndash QTAQ will be block upper triangularndash Apply recursively to A11 A22

                                                                          ndash Depends on randomization1 Randomized Rank Revealing QR decomposition2 Randomized location to try splitting spectrum

                                                                          A11 A12

                                                                          ε A22

                                                                          Attaining the Lower bounds SequentialLegend[Existing][Ours][Math-Lib][Random]

                                                                          Two Levels Memory Hierarchy

                                                                          Words Messages Words Messages

                                                                          BLAS-3 [FLPRrsquo99][BDLSTrsquo13][MKL etc] [FLPRrsquo99][BDLSTrsquo13][MKL etc]

                                                                          Cholesky[Grsquo97][APrsquo00]

                                                                          [LAPACK][BDHSrsquo09]

                                                                          [Grsquo97][APrsquo00][BDHSrsquo09] [Grsquo97][APrsquo00][BDHSrsquo09]

                                                                          Sym Indefinite [BBDDDPSTYrsquo13] [BBDDDPSTYrsquo13]

                                                                          LU[Grsquo97][Trsquo97]

                                                                          [GDXrsquo11][BDLSTrsquo13]

                                                                          [GDXrsquo11][BDLSTrsquo13]

                                                                          [Grsquo97][Trsquo97] [BDLSTrsquo13] [BDLSTrsquo13]

                                                                          QR[EGrsquo98][FWrsquo03]

                                                                          [DGHLrsquo12][BDLSTrsquo13]

                                                                          [FWrsquo03][DGHLrsquo12][BDLSTrsquo13]

                                                                          [EGrsquo98][FWrsquo03][BDLSTrsquo13]

                                                                          [FWrsquo03][BDLSTrsquo13]

                                                                          Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

                                                                          Sym Eig amp SVD [BDDrsquo11][BDKrsquo13] [BDDrsquo11]

                                                                          Non Sym Eig [BDDrsquo11] [BDDrsquo11]

                                                                          Legend[Existing][Ours][Math-Lib][Random]

                                                                          Words (BW) Messages (L) Saving factor

                                                                          BLAS-3 [AGZrsquo94][MTrsquo99][ScaLAPACK][Crsquo69][vGWrsquo97][SDrsquo11] L nP12

                                                                          Cholesky [ScaLAPACK][Trsquo99][SDrsquo11] L nP12

                                                                          Sym Indefinite [BBDDDPSTYrsquo13][ScaLAPACK] [BBDDDPSTYrsquo13] L nP12

                                                                          LU [ScaLAPACK][GDXrsquo11][Trsquo99][SDrsquo11] [GDXrsquo11][Trsquo99][SDrsquo11] L nP12

                                                                          QR [ScaLAPACK][DGHLrsquo12] [Trsquo99] [DGHLrsquo12][Trsquo99] L nP12

                                                                          Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

                                                                          Sym Eig amp SVD [BDDrsquo11][BDKrsquo13][ScaLAPACK] [BDDrsquo11][BDKrsquo13] L nP12

                                                                          Non-Sym Eig [BDDrsquo11] [BDDrsquo11] BW P12 L n

                                                                          Attaining with extra memory 25D M=(cn2P)

                                                                          Attaining the Lower bounds Parallel 2DM=(n2P)(Ignoring poly-log(P) factors words = ( n2 P12) messages = (P12)

                                                                          Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                          ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                          ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                          bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                          bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                          Avoiding Communication in Iterative Linear Algebra

                                                                          bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

                                                                          bull Conjugate Gradients (CG) GMRES Lanczos Arnoldi hellip bull Goal minimize communication

                                                                          ndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

                                                                          bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

                                                                          ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

                                                                          bull Lots of speed up possible (modeled and measured)ndash Price some redundant computationndash Challenges Poor partitioning Preconditioning Num Stability

                                                                          75

                                                                          Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                          ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                          ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                          bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                          bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                          Example The Difficulty of Tuning SpMV

                                                                          bull n = 21200bull nnz = 15 M

                                                                          bull Source NASA structural analysis problem (raefsky)

                                                                          77

                                                                          Example The Difficulty of Tuning

                                                                          bull n = 21200bull nnz = 15 M

                                                                          bull Source NASA structural analysis problem (raefsky)

                                                                          bull 8x8 dense substructure exploit this to limit mem_refs

                                                                          78

                                                                          Speedups on Itanium 2 The Need for Search

                                                                          Reference

                                                                          Best 4x2

                                                                          Mflops

                                                                          Mflops

                                                                          79

                                                                          Register Profile Itanium 2

                                                                          190 Mflops

                                                                          1190 Mflops

                                                                          80

                                                                          Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16

                                                                          Itanium 2 - 33Itanium 1 - 8

                                                                          252 Mflops

                                                                          122 Mflops

                                                                          820 Mflops

                                                                          459 Mflops

                                                                          247 Mflops

                                                                          107 Mflops

                                                                          12 Gflops

                                                                          190 Mflops

                                                                          Another example of tuning challenges for SpMV

                                                                          bull Ex11 matrix (fluid flow)

                                                                          bull More complicated non-zero structure in general

                                                                          bull N = 16614bull NNZ = 11M

                                                                          82

                                                                          Zoom in to top corner

                                                                          bull More complicated non-zero structure in general

                                                                          bull N = 16614bull NNZ = 11M

                                                                          83

                                                                          3x3 blocks look natural buthellip

                                                                          bull Example 3x3 blockingndash Logical grid of 3x3 cells

                                                                          bull But would lead to lots of ldquofill-inrdquo

                                                                          84

                                                                          Extra Work Can Improve Efficiency

                                                                          bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

                                                                          bull On Pentium III 15x speedupndash Actual mflop rate 152 = 225 higher

                                                                          85

                                                                          Source Accelerator Cavity Design Problem (Ko via Husbands)

                                                                          86

                                                                          100x100 Submatrix Along Diagonal

                                                                          Summer School Lecture 787

                                                                          Post-RCM Reordering

                                                                          88

                                                                          Effect of Combined RCM+TSP Reordering

                                                                          Before Green + RedAfter Green + Blue

                                                                          Summer School Lecture 789

                                                                          2x speedups on Pentium 4 Power 4 hellip

                                                                          Summary of Other Performance Optimizations

                                                                          bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

                                                                          bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

                                                                          bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

                                                                          90

                                                                          Optimized Sparse Kernel Interface - OSKI

                                                                          bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

                                                                          bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

                                                                          bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

                                                                          software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

                                                                          91

                                                                          Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                          ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                          ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                          bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                          bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                          93

                                                                          Example Classical Conjugate Gradient (CG)

                                                                          SpMVs and dot products require communication in

                                                                          each iteration

                                                                          via CA Matrix Powers Kernel

                                                                          Global reduction to compute G

                                                                          94

                                                                          Example CA-Conjugate Gradient

                                                                          Local computations within inner loop require

                                                                          no communication

                                                                          Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                          ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                          ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                          bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                          bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                          96

                                                                          Slower convergence due

                                                                          to roundoff

                                                                          Loss of accuracy due to roundoff

                                                                          At s = 16 monomial basis is rank deficient Method breaks down

                                                                          Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

                                                                          CA-CG (monomial)CG

                                                                          machine precision

                                                                          97

                                                                          Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                          ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                          ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                          bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                          bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                          What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

                                                                          bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

                                                                          bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

                                                                          matrices

                                                                          Explicit (O(nnz)) Implicit (o(nnz))

                                                                          Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

                                                                          Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

                                                                          Indices

                                                                          Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                          ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                          ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                          bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                          bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                          101

                                                                          bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

                                                                          ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

                                                                          demanded by customers (construction engineers) otherwise they donrsquot believe results

                                                                          ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

                                                                          ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

                                                                          Reproducible Floating Point Computation

                                                                          Absolute Error for Random Vectors

                                                                          Same magnitude opposite signs

                                                                          Intel MKL non-reproducibility

                                                                          Relative Error for Orthogonal vectors

                                                                          Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

                                                                          Sign notreproducible

                                                                          103

                                                                          bull Consider summation or dot productbull Goals

                                                                          1 Same answer independent of layout processors order of summands

                                                                          2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

                                                                          bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

                                                                          GoalsApproaches for Reproducibility

                                                                          104

                                                                          Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

                                                                          Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

                                                                          Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

                                                                          bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

                                                                          Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

                                                                          bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

                                                                          Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

                                                                          Instruments NEC Nokia NVIDIA Samsung Oracle

                                                                          bull bebopcsberkeleyedu

                                                                          Summary

                                                                          Donrsquot Communichellip

                                                                          106

                                                                          Time to redesign all linear algebra n-body hellip algorithms and software

                                                                          (and compilers)

                                                                          • Implementing Communication-Avoiding Algorithms
                                                                          • Why avoid communication
                                                                          • Goals
                                                                          • Outline
                                                                          • Outline (2)
                                                                          • Lower bound for all ldquon3-likerdquo linear algebra
                                                                          • Lower bound for all ldquon3-likerdquo linear algebra (2)
                                                                          • Lower bound for all ldquon3-likerdquo linear algebra (3)
                                                                          • Limits to parallel scaling (12)
                                                                          • Limits to parallel scaling (22)
                                                                          • Can we attain these lower bounds
                                                                          • Outline (3)
                                                                          • 25D Matrix Multiplication
                                                                          • 25D Matrix Multiplication (2)
                                                                          • 25D Matmul on BGP 16K nodes 64K cores (2)
                                                                          • Perfect Strong Scaling ndash in Time and Energy (12)
                                                                          • Perfect Strong Scaling ndash in Time and Energy (22)
                                                                          • Handling Heterogeneity
                                                                          • Application to Tensor Contractions
                                                                          • C(ijk) = Σm A(ijm)B(mk)
                                                                          • Application to Tensor Contractions (2)
                                                                          • Communication Lower Bounds for Strassen-like matmul algorithms
                                                                          • vs
                                                                          • Slide 26
                                                                          • Strassen-like beyond matmul
                                                                          • Cache and Network Oblivious Algorithms
                                                                          • CARMA Performance Distributed Memory
                                                                          • CARMA Performance Distributed Memory (2)
                                                                          • CARMA Performance Shared Memory
                                                                          • CARMA Performance Shared Memory (2)
                                                                          • Why is CARMA Faster in Shared Memory
                                                                          • Outline (4)
                                                                          • One-sided Factorizations (LU QR) so far
                                                                          • TSQR An Architecture-Dependent Algorithm
                                                                          • Back to LU Using similar idea for TSLU as TSQR Use reduction
                                                                          • Minimizing Communication in TSLU
                                                                          • Making TSLU Numerically Stable
                                                                          • Stability of LU using TSLU CALU
                                                                          • Why is stability of TSLU just a ldquoThmrdquo
                                                                          • Fixing TSLU
                                                                          • 2D CALU with Tournament Pivoting
                                                                          • 25D CALU with Tournament Pivoting (c=4 copies)
                                                                          • Exascale Machine Parameters Source DOE Exascale Workshop
                                                                          • Exascale predicted speedups for Gaussian Elimination 2D CA
                                                                          • 25D vs 2D LU With and Without Pivoting
                                                                          • Other CA algorithms for Ax=b least squares(13)
                                                                          • Other CA algorithms for Ax=b least squares (23)
                                                                          • Other CA algorithms for Ax=b least squares (33)
                                                                          • Outline (5)
                                                                          • What about sparse matrices (13)
                                                                          • Performance of 25D APSP using Kleene
                                                                          • What about sparse matrices (23)
                                                                          • What about sparse matrices (33)
                                                                          • Outline (6)
                                                                          • Symmetric Eigenproblem and SVD
                                                                          • Slide 58
                                                                          • Slide 59
                                                                          • Slide 60
                                                                          • Slide 61
                                                                          • Slide 62
                                                                          • Slide 63
                                                                          • Slide 64
                                                                          • Slide 65
                                                                          • Slide 66
                                                                          • Slide 67
                                                                          • Slide 68
                                                                          • Conventional vs CA - SBR
                                                                          • Speedups of Sym Band Reduction vs DSBTRD
                                                                          • Nonsymmetric Eigenproblem
                                                                          • Attaining the Lower bounds Sequential
                                                                          • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
                                                                          • Outline (7)
                                                                          • Avoiding Communication in Iterative Linear Algebra
                                                                          • Outline (8)
                                                                          • Example The Difficulty of Tuning SpMV
                                                                          • Example The Difficulty of Tuning
                                                                          • Speedups on Itanium 2 The Need for Search
                                                                          • Register Profile Itanium 2
                                                                          • Register Profiles IBM and Intel IA-64
                                                                          • Another example of tuning challenges for SpMV
                                                                          • Zoom in to top corner
                                                                          • 3x3 blocks look natural buthellip
                                                                          • Extra Work Can Improve Efficiency
                                                                          • Slide 86
                                                                          • Slide 87
                                                                          • Slide 88
                                                                          • Slide 89
                                                                          • Summary of Other Performance Optimizations
                                                                          • Optimized Sparse Kernel Interface - OSKI
                                                                          • Outline (9)
                                                                          • Example Classical Conjugate Gradient (CG)
                                                                          • Example CA-Conjugate Gradient
                                                                          • Outline (10)
                                                                          • Slide 96
                                                                          • Slide 97
                                                                          • Outline (11)
                                                                          • What is a ldquosparse matrixrdquo
                                                                          • Outline (12)
                                                                          • Reproducible Floating Point Computation
                                                                          • Intel MKL non-reproducibility
                                                                          • GoalsApproaches for Reproducibility
                                                                          • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
                                                                          • Collaborators and Supporters
                                                                          • Summary

                                                                            Stability of LU using TSLU CALU

                                                                            Summer School Lecture 4 40

                                                                            bull Empirical testingndash Both random matrices and ldquospecial onesrdquondash Both binary tree (BCALU) and flat-tree (FCALU)ndash 3 metrics ||PA-LU||||A|| normwise and componentwise backward errorsndash See [D Grigori Xiang 2010] for details

                                                                            Why is stability of TSLU just a ldquoThmrdquo

                                                                            bull Proof is correct ndash in exact arithmeticbull Experiment

                                                                            ndash Generate 100 random 6x6 rank 3 matrices in Matlabndash [LUP] = lu(A) do LU without pivoting on PA compare L factors are

                                                                            they the samebull Compute || L ndash Lnp || A few 0rsquos A few infinrsquos a few NaNsbull Rest mostly O(1)

                                                                            ndash Why Floating point is nonassociative doing arithmetic in different order gives different rounding errors

                                                                            ndash Same experiment with rank 6 matrices || L ndash Lnp || usually nonzero O(macheps)

                                                                            ndash Same experiment with 20x20 rank 4 matrices || L ndash Lnp || often O(103)

                                                                            bull Much harder to break TSLU but possiblendash Occurred when using TSLU to factorize a low-rank subdiagonal

                                                                            panel in symmetric-indefinite factorization 41

                                                                            Fixing TSLU

                                                                            bull Run TSLU quickly test for stability fix if necessary (rare)

                                                                            bull Test conditioning of U if not tiny (usual case) proceed elsebull Compute || L || if not big (usual case) proceed elsebull Factor A = QR using TSQR thenbull Factor Q = PLU using TSLU thenbull A = PL(UR) with UR as upper triangular factor

                                                                            bull Last topic in lecture how to guarantee floating point reproducibility

                                                                            42

                                                                            2D CALU with Tournament Pivoting

                                                                            43

                                                                            25D CALU with Tournament Pivoting (c=4 copies)

                                                                            44

                                                                            Exascale Machine ParametersSource DOE Exascale Workshop

                                                                            bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

                                                                            Exascale predicted speedupsfor Gaussian Elimination

                                                                            2D CA-LU vs ScaLAPACK-LU

                                                                            log2 (p)

                                                                            log 2

                                                                            (n2 p

                                                                            ) =

                                                                            log 2

                                                                            (mem

                                                                            ory_

                                                                            per_

                                                                            proc

                                                                            )

                                                                            Up to 29x

                                                                            25D vs 2D LUWith and Without Pivoting

                                                                            Other CA algorithms for Ax=b least squares(13)

                                                                            bull A symmetric and indefinitendash Seek factorization that retains symmetry PAPT = LDLT D

                                                                            ldquosimplerdquobull Save frac12 flops preserve inertia

                                                                            ndash Usual approach Bunch-Kaufmanbull D block diagonal with 1x1 and 2x2 blocksbull Pivot search down column along row (lots of communication)

                                                                            ndash Alternative Aasenbull D = tridiagonal = Tbull Two steps

                                                                            ndash PAPT = LTLT where T is banded using TSLU

                                                                            48

                                                                            0 0

                                                                            0

                                                                            0 0

                                                                            0

                                                                            0

                                                                            hellip

                                                                            hellip

                                                                            ndash Solvefactor narrow band problem with Tbull Up to 28x faster than MKL Best Paper at IPDPSrsquo13

                                                                            Other CA algorithms for Ax=b least squares (23)bull Minimizing bandwidth and latency for sequential GEPP

                                                                            ndash So far could not do partial pivoting and minimize messages just words

                                                                            ndash Challengebull Column layout good for choosing pivots bad for matmulbull Blocked layout good for matmul bad for choosing pivots

                                                                            ndash Solution use both layouts switching between thembull ldquoShape Morphing LUrdquo or SMLU

                                                                            49

                                                                            bull func factor(A) if A has 1 column update it else factor(left half of A)

                                                                            update right half of A

                                                                            factor(right half of A)

                                                                            bull Words = O(n3M12)

                                                                            bull Messages = O(n3M)

                                                                            bull func factor(A) if A has 1 column update it else factor(left half of A) reshape to recursive block format update right half of A reshape to columnwise format factor(right half of A)

                                                                            bull Words = O(n3M12)

                                                                            bull Messages = O(n3M32)

                                                                            Other CA algorithms for Ax=b least squares (33)bull Need for pivoting arises beyond LU in QR

                                                                            ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)

                                                                            ndash Usual approach like Partial Pivotingbull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound

                                                                            ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two

                                                                            groups of b columns either using usual approach or something better (GuEisenstat)

                                                                            bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix

                                                                            ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 50

                                                                            Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                            ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                            ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                            bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                            bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                            What about sparse matrices (13)

                                                                            bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

                                                                            bull But canrsquot reorder outer loop for 25D need another ideabull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = AB

                                                                            ndash Dependencies ok 25D works just different semiringbull Kleenersquos Algorithm

                                                                            52

                                                                            for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

                                                                            D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

                                                                            Performance of 25D APSP using Kleene

                                                                            53

                                                                            Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

                                                                            62xspeedup

                                                                            2x speedup

                                                                            What about sparse matrices (23)

                                                                            bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of G(A)

                                                                            have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

                                                                            bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering for

                                                                            2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

                                                                            bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

                                                                            multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

                                                                            separators)

                                                                            54

                                                                            What about sparse matrices (33)

                                                                            bull If matrix stays very sparse lower bound unattainable new one bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

                                                                            data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

                                                                            bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

                                                                            Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

                                                                            bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

                                                                            along dimensions most likely to minimize cost55

                                                                            Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                            ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                            ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                            bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                            bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                            Symmetric Eigenproblem and SVD

                                                                            bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd

                                                                            bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12

                                                                            ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea

                                                                            b+1

                                                                            b+1

                                                                            Successive Band Reduction (BischofLangSun)

                                                                            1

                                                                            b+1

                                                                            b+1

                                                                            d+1

                                                                            c

                                                                            Successive Band Reduction (BischofLangSun)

                                                                            b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                            1Q1

                                                                            b+1

                                                                            b+1

                                                                            d+1

                                                                            c

                                                                            b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                            Successive Band Reduction (BischofLangSun)

                                                                            12

                                                                            Q1

                                                                            b+1

                                                                            b+1

                                                                            d+1

                                                                            d+c

                                                                            d+c

                                                                            c

                                                                            b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                            Successive Band Reduction (BischofLangSun)

                                                                            1

                                                                            12

                                                                            Q1

                                                                            Q1T

                                                                            b+1

                                                                            b+1

                                                                            d+1

                                                                            d+1

                                                                            cd+c

                                                                            d+c

                                                                            c

                                                                            b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                            Successive Band Reduction (BischofLangSun)

                                                                            1

                                                                            1

                                                                            2

                                                                            2Q1

                                                                            Q1T

                                                                            b+1

                                                                            b+1

                                                                            d+1

                                                                            d+1

                                                                            cd+c

                                                                            d+c

                                                                            d+c

                                                                            d+c

                                                                            c

                                                                            b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                            Successive Band Reduction (BischofLangSun)

                                                                            1

                                                                            1

                                                                            2

                                                                            2

                                                                            3

                                                                            3

                                                                            Q1

                                                                            Q1T

                                                                            Q2

                                                                            Q2T

                                                                            b+1

                                                                            b+1

                                                                            d+1

                                                                            d+1

                                                                            d+c

                                                                            d+c

                                                                            d+c

                                                                            d+c

                                                                            c

                                                                            c

                                                                            b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                            Successive Band Reduction (BischofLangSun)

                                                                            1

                                                                            1

                                                                            2

                                                                            2

                                                                            3

                                                                            3

                                                                            4

                                                                            4

                                                                            Q1

                                                                            Q1T

                                                                            Q2

                                                                            Q2T

                                                                            Q3

                                                                            Q3T

                                                                            b+1

                                                                            b+1

                                                                            d+1

                                                                            d+1

                                                                            d+c

                                                                            d+c

                                                                            d+c

                                                                            d+c

                                                                            c

                                                                            c

                                                                            b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                            Successive Band Reduction (BischofLangSun)

                                                                            1

                                                                            1

                                                                            2

                                                                            2

                                                                            3

                                                                            3

                                                                            4

                                                                            4

                                                                            5

                                                                            5

                                                                            Q1

                                                                            Q1T

                                                                            Q2

                                                                            Q2T

                                                                            Q3

                                                                            Q3T

                                                                            Q4

                                                                            Q4T

                                                                            b+1

                                                                            b+1

                                                                            d+1

                                                                            d+1

                                                                            c

                                                                            c

                                                                            d+c

                                                                            d+c

                                                                            d+c

                                                                            d+c

                                                                            b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                            Successive Band Reduction (BischofLangSun)

                                                                            1

                                                                            1

                                                                            2

                                                                            2

                                                                            3

                                                                            3

                                                                            4

                                                                            4

                                                                            5

                                                                            5

                                                                            Q5T

                                                                            Q1

                                                                            Q1T

                                                                            Q2

                                                                            Q2T

                                                                            Q3

                                                                            Q3T

                                                                            Q5

                                                                            Q4

                                                                            Q4T

                                                                            b+1

                                                                            b+1

                                                                            d+1

                                                                            d+1

                                                                            c

                                                                            c

                                                                            d+c

                                                                            d+c

                                                                            d+c

                                                                            d+c

                                                                            b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                            Successive Band Reduction (BischofLangSun)

                                                                            1

                                                                            1

                                                                            2

                                                                            2

                                                                            3

                                                                            3

                                                                            4

                                                                            4

                                                                            5

                                                                            5

                                                                            6

                                                                            6

                                                                            Q5T

                                                                            Q1

                                                                            Q1T

                                                                            Q2

                                                                            Q2T

                                                                            Q3

                                                                            Q3T

                                                                            Q5

                                                                            Q4

                                                                            Q4T

                                                                            b+1

                                                                            b+1

                                                                            d+1

                                                                            d+1

                                                                            c

                                                                            c

                                                                            d+c

                                                                            d+c

                                                                            d+c

                                                                            d+c

                                                                            b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                            Successive Band Reduction (BischofLangSun)

                                                                            Conventional vs CA - SBR

                                                                            Conventional Communication-Avoiding

                                                                            Touch all data 4 times Touch all data once

                                                                            >
                                                                            >

                                                                            Speedups of Sym Band Reductionvs DSBTRD

                                                                            bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

                                                                            bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

                                                                            bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

                                                                            bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

                                                                            bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

                                                                            Nonsymmetric Eigenproblem

                                                                            bull No apparent way to modify standard algorithmbull Instead Spectral Divide-and-Conquer

                                                                            ndash Find orthogonal matrix Q whose leading columns span an invariant subspace of A

                                                                            ndash QTAQ will be block upper triangularndash Apply recursively to A11 A22

                                                                            ndash Depends on randomization1 Randomized Rank Revealing QR decomposition2 Randomized location to try splitting spectrum

                                                                            A11 A12

                                                                            ε A22

                                                                            Attaining the Lower bounds SequentialLegend[Existing][Ours][Math-Lib][Random]

                                                                            Two Levels Memory Hierarchy

                                                                            Words Messages Words Messages

                                                                            BLAS-3 [FLPRrsquo99][BDLSTrsquo13][MKL etc] [FLPRrsquo99][BDLSTrsquo13][MKL etc]

                                                                            Cholesky[Grsquo97][APrsquo00]

                                                                            [LAPACK][BDHSrsquo09]

                                                                            [Grsquo97][APrsquo00][BDHSrsquo09] [Grsquo97][APrsquo00][BDHSrsquo09]

                                                                            Sym Indefinite [BBDDDPSTYrsquo13] [BBDDDPSTYrsquo13]

                                                                            LU[Grsquo97][Trsquo97]

                                                                            [GDXrsquo11][BDLSTrsquo13]

                                                                            [GDXrsquo11][BDLSTrsquo13]

                                                                            [Grsquo97][Trsquo97] [BDLSTrsquo13] [BDLSTrsquo13]

                                                                            QR[EGrsquo98][FWrsquo03]

                                                                            [DGHLrsquo12][BDLSTrsquo13]

                                                                            [FWrsquo03][DGHLrsquo12][BDLSTrsquo13]

                                                                            [EGrsquo98][FWrsquo03][BDLSTrsquo13]

                                                                            [FWrsquo03][BDLSTrsquo13]

                                                                            Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

                                                                            Sym Eig amp SVD [BDDrsquo11][BDKrsquo13] [BDDrsquo11]

                                                                            Non Sym Eig [BDDrsquo11] [BDDrsquo11]

                                                                            Legend[Existing][Ours][Math-Lib][Random]

                                                                            Words (BW) Messages (L) Saving factor

                                                                            BLAS-3 [AGZrsquo94][MTrsquo99][ScaLAPACK][Crsquo69][vGWrsquo97][SDrsquo11] L nP12

                                                                            Cholesky [ScaLAPACK][Trsquo99][SDrsquo11] L nP12

                                                                            Sym Indefinite [BBDDDPSTYrsquo13][ScaLAPACK] [BBDDDPSTYrsquo13] L nP12

                                                                            LU [ScaLAPACK][GDXrsquo11][Trsquo99][SDrsquo11] [GDXrsquo11][Trsquo99][SDrsquo11] L nP12

                                                                            QR [ScaLAPACK][DGHLrsquo12] [Trsquo99] [DGHLrsquo12][Trsquo99] L nP12

                                                                            Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

                                                                            Sym Eig amp SVD [BDDrsquo11][BDKrsquo13][ScaLAPACK] [BDDrsquo11][BDKrsquo13] L nP12

                                                                            Non-Sym Eig [BDDrsquo11] [BDDrsquo11] BW P12 L n

                                                                            Attaining with extra memory 25D M=(cn2P)

                                                                            Attaining the Lower bounds Parallel 2DM=(n2P)(Ignoring poly-log(P) factors words = ( n2 P12) messages = (P12)

                                                                            Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                            ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                            ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                            bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                            bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                            Avoiding Communication in Iterative Linear Algebra

                                                                            bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

                                                                            bull Conjugate Gradients (CG) GMRES Lanczos Arnoldi hellip bull Goal minimize communication

                                                                            ndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

                                                                            bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

                                                                            ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

                                                                            bull Lots of speed up possible (modeled and measured)ndash Price some redundant computationndash Challenges Poor partitioning Preconditioning Num Stability

                                                                            75

                                                                            Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                            ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                            ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                            bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                            bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                            Example The Difficulty of Tuning SpMV

                                                                            bull n = 21200bull nnz = 15 M

                                                                            bull Source NASA structural analysis problem (raefsky)

                                                                            77

                                                                            Example The Difficulty of Tuning

                                                                            bull n = 21200bull nnz = 15 M

                                                                            bull Source NASA structural analysis problem (raefsky)

                                                                            bull 8x8 dense substructure exploit this to limit mem_refs

                                                                            78

                                                                            Speedups on Itanium 2 The Need for Search

                                                                            Reference

                                                                            Best 4x2

                                                                            Mflops

                                                                            Mflops

                                                                            79

                                                                            Register Profile Itanium 2

                                                                            190 Mflops

                                                                            1190 Mflops

                                                                            80

                                                                            Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16

                                                                            Itanium 2 - 33Itanium 1 - 8

                                                                            252 Mflops

                                                                            122 Mflops

                                                                            820 Mflops

                                                                            459 Mflops

                                                                            247 Mflops

                                                                            107 Mflops

                                                                            12 Gflops

                                                                            190 Mflops

                                                                            Another example of tuning challenges for SpMV

                                                                            bull Ex11 matrix (fluid flow)

                                                                            bull More complicated non-zero structure in general

                                                                            bull N = 16614bull NNZ = 11M

                                                                            82

                                                                            Zoom in to top corner

                                                                            bull More complicated non-zero structure in general

                                                                            bull N = 16614bull NNZ = 11M

                                                                            83

                                                                            3x3 blocks look natural buthellip

                                                                            bull Example 3x3 blockingndash Logical grid of 3x3 cells

                                                                            bull But would lead to lots of ldquofill-inrdquo

                                                                            84

                                                                            Extra Work Can Improve Efficiency

                                                                            bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

                                                                            bull On Pentium III 15x speedupndash Actual mflop rate 152 = 225 higher

                                                                            85

                                                                            Source Accelerator Cavity Design Problem (Ko via Husbands)

                                                                            86

                                                                            100x100 Submatrix Along Diagonal

                                                                            Summer School Lecture 787

                                                                            Post-RCM Reordering

                                                                            88

                                                                            Effect of Combined RCM+TSP Reordering

                                                                            Before Green + RedAfter Green + Blue

                                                                            Summer School Lecture 789

                                                                            2x speedups on Pentium 4 Power 4 hellip

                                                                            Summary of Other Performance Optimizations

                                                                            bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

                                                                            bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

                                                                            bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

                                                                            90

                                                                            Optimized Sparse Kernel Interface - OSKI

                                                                            bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

                                                                            bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

                                                                            bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

                                                                            software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

                                                                            91

                                                                            Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                            ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                            ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                            bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                            bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                            93

                                                                            Example Classical Conjugate Gradient (CG)

                                                                            SpMVs and dot products require communication in

                                                                            each iteration

                                                                            via CA Matrix Powers Kernel

                                                                            Global reduction to compute G

                                                                            94

                                                                            Example CA-Conjugate Gradient

                                                                            Local computations within inner loop require

                                                                            no communication

                                                                            Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                            ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                            ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                            bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                            bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                            96

                                                                            Slower convergence due

                                                                            to roundoff

                                                                            Loss of accuracy due to roundoff

                                                                            At s = 16 monomial basis is rank deficient Method breaks down

                                                                            Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

                                                                            CA-CG (monomial)CG

                                                                            machine precision

                                                                            97

                                                                            Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                            ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                            ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                            bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                            bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                            What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

                                                                            bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

                                                                            bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

                                                                            matrices

                                                                            Explicit (O(nnz)) Implicit (o(nnz))

                                                                            Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

                                                                            Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

                                                                            Indices

                                                                            Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                            ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                            ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                            bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                            bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                            101

                                                                            bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

                                                                            ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

                                                                            demanded by customers (construction engineers) otherwise they donrsquot believe results

                                                                            ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

                                                                            ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

                                                                            Reproducible Floating Point Computation

                                                                            Absolute Error for Random Vectors

                                                                            Same magnitude opposite signs

                                                                            Intel MKL non-reproducibility

                                                                            Relative Error for Orthogonal vectors

                                                                            Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

                                                                            Sign notreproducible

                                                                            103

                                                                            bull Consider summation or dot productbull Goals

                                                                            1 Same answer independent of layout processors order of summands

                                                                            2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

                                                                            bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

                                                                            GoalsApproaches for Reproducibility

                                                                            104

                                                                            Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

                                                                            Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

                                                                            Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

                                                                            bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

                                                                            Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

                                                                            bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

                                                                            Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

                                                                            Instruments NEC Nokia NVIDIA Samsung Oracle

                                                                            bull bebopcsberkeleyedu

                                                                            Summary

                                                                            Donrsquot Communichellip

                                                                            106

                                                                            Time to redesign all linear algebra n-body hellip algorithms and software

                                                                            (and compilers)

                                                                            • Implementing Communication-Avoiding Algorithms
                                                                            • Why avoid communication
                                                                            • Goals
                                                                            • Outline
                                                                            • Outline (2)
                                                                            • Lower bound for all ldquon3-likerdquo linear algebra
                                                                            • Lower bound for all ldquon3-likerdquo linear algebra (2)
                                                                            • Lower bound for all ldquon3-likerdquo linear algebra (3)
                                                                            • Limits to parallel scaling (12)
                                                                            • Limits to parallel scaling (22)
                                                                            • Can we attain these lower bounds
                                                                            • Outline (3)
                                                                            • 25D Matrix Multiplication
                                                                            • 25D Matrix Multiplication (2)
                                                                            • 25D Matmul on BGP 16K nodes 64K cores (2)
                                                                            • Perfect Strong Scaling ndash in Time and Energy (12)
                                                                            • Perfect Strong Scaling ndash in Time and Energy (22)
                                                                            • Handling Heterogeneity
                                                                            • Application to Tensor Contractions
                                                                            • C(ijk) = Σm A(ijm)B(mk)
                                                                            • Application to Tensor Contractions (2)
                                                                            • Communication Lower Bounds for Strassen-like matmul algorithms
                                                                            • vs
                                                                            • Slide 26
                                                                            • Strassen-like beyond matmul
                                                                            • Cache and Network Oblivious Algorithms
                                                                            • CARMA Performance Distributed Memory
                                                                            • CARMA Performance Distributed Memory (2)
                                                                            • CARMA Performance Shared Memory
                                                                            • CARMA Performance Shared Memory (2)
                                                                            • Why is CARMA Faster in Shared Memory
                                                                            • Outline (4)
                                                                            • One-sided Factorizations (LU QR) so far
                                                                            • TSQR An Architecture-Dependent Algorithm
                                                                            • Back to LU Using similar idea for TSLU as TSQR Use reduction
                                                                            • Minimizing Communication in TSLU
                                                                            • Making TSLU Numerically Stable
                                                                            • Stability of LU using TSLU CALU
                                                                            • Why is stability of TSLU just a ldquoThmrdquo
                                                                            • Fixing TSLU
                                                                            • 2D CALU with Tournament Pivoting
                                                                            • 25D CALU with Tournament Pivoting (c=4 copies)
                                                                            • Exascale Machine Parameters Source DOE Exascale Workshop
                                                                            • Exascale predicted speedups for Gaussian Elimination 2D CA
                                                                            • 25D vs 2D LU With and Without Pivoting
                                                                            • Other CA algorithms for Ax=b least squares(13)
                                                                            • Other CA algorithms for Ax=b least squares (23)
                                                                            • Other CA algorithms for Ax=b least squares (33)
                                                                            • Outline (5)
                                                                            • What about sparse matrices (13)
                                                                            • Performance of 25D APSP using Kleene
                                                                            • What about sparse matrices (23)
                                                                            • What about sparse matrices (33)
                                                                            • Outline (6)
                                                                            • Symmetric Eigenproblem and SVD
                                                                            • Slide 58
                                                                            • Slide 59
                                                                            • Slide 60
                                                                            • Slide 61
                                                                            • Slide 62
                                                                            • Slide 63
                                                                            • Slide 64
                                                                            • Slide 65
                                                                            • Slide 66
                                                                            • Slide 67
                                                                            • Slide 68
                                                                            • Conventional vs CA - SBR
                                                                            • Speedups of Sym Band Reduction vs DSBTRD
                                                                            • Nonsymmetric Eigenproblem
                                                                            • Attaining the Lower bounds Sequential
                                                                            • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
                                                                            • Outline (7)
                                                                            • Avoiding Communication in Iterative Linear Algebra
                                                                            • Outline (8)
                                                                            • Example The Difficulty of Tuning SpMV
                                                                            • Example The Difficulty of Tuning
                                                                            • Speedups on Itanium 2 The Need for Search
                                                                            • Register Profile Itanium 2
                                                                            • Register Profiles IBM and Intel IA-64
                                                                            • Another example of tuning challenges for SpMV
                                                                            • Zoom in to top corner
                                                                            • 3x3 blocks look natural buthellip
                                                                            • Extra Work Can Improve Efficiency
                                                                            • Slide 86
                                                                            • Slide 87
                                                                            • Slide 88
                                                                            • Slide 89
                                                                            • Summary of Other Performance Optimizations
                                                                            • Optimized Sparse Kernel Interface - OSKI
                                                                            • Outline (9)
                                                                            • Example Classical Conjugate Gradient (CG)
                                                                            • Example CA-Conjugate Gradient
                                                                            • Outline (10)
                                                                            • Slide 96
                                                                            • Slide 97
                                                                            • Outline (11)
                                                                            • What is a ldquosparse matrixrdquo
                                                                            • Outline (12)
                                                                            • Reproducible Floating Point Computation
                                                                            • Intel MKL non-reproducibility
                                                                            • GoalsApproaches for Reproducibility
                                                                            • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
                                                                            • Collaborators and Supporters
                                                                            • Summary

                                                                              Why is stability of TSLU just a ldquoThmrdquo

                                                                              bull Proof is correct ndash in exact arithmeticbull Experiment

                                                                              ndash Generate 100 random 6x6 rank 3 matrices in Matlabndash [LUP] = lu(A) do LU without pivoting on PA compare L factors are

                                                                              they the samebull Compute || L ndash Lnp || A few 0rsquos A few infinrsquos a few NaNsbull Rest mostly O(1)

                                                                              ndash Why Floating point is nonassociative doing arithmetic in different order gives different rounding errors

                                                                              ndash Same experiment with rank 6 matrices || L ndash Lnp || usually nonzero O(macheps)

                                                                              ndash Same experiment with 20x20 rank 4 matrices || L ndash Lnp || often O(103)

                                                                              bull Much harder to break TSLU but possiblendash Occurred when using TSLU to factorize a low-rank subdiagonal

                                                                              panel in symmetric-indefinite factorization 41

                                                                              Fixing TSLU

                                                                              bull Run TSLU quickly test for stability fix if necessary (rare)

                                                                              bull Test conditioning of U if not tiny (usual case) proceed elsebull Compute || L || if not big (usual case) proceed elsebull Factor A = QR using TSQR thenbull Factor Q = PLU using TSLU thenbull A = PL(UR) with UR as upper triangular factor

                                                                              bull Last topic in lecture how to guarantee floating point reproducibility

                                                                              42

                                                                              2D CALU with Tournament Pivoting

                                                                              43

                                                                              25D CALU with Tournament Pivoting (c=4 copies)

                                                                              44

                                                                              Exascale Machine ParametersSource DOE Exascale Workshop

                                                                              bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

                                                                              Exascale predicted speedupsfor Gaussian Elimination

                                                                              2D CA-LU vs ScaLAPACK-LU

                                                                              log2 (p)

                                                                              log 2

                                                                              (n2 p

                                                                              ) =

                                                                              log 2

                                                                              (mem

                                                                              ory_

                                                                              per_

                                                                              proc

                                                                              )

                                                                              Up to 29x

                                                                              25D vs 2D LUWith and Without Pivoting

                                                                              Other CA algorithms for Ax=b least squares(13)

                                                                              bull A symmetric and indefinitendash Seek factorization that retains symmetry PAPT = LDLT D

                                                                              ldquosimplerdquobull Save frac12 flops preserve inertia

                                                                              ndash Usual approach Bunch-Kaufmanbull D block diagonal with 1x1 and 2x2 blocksbull Pivot search down column along row (lots of communication)

                                                                              ndash Alternative Aasenbull D = tridiagonal = Tbull Two steps

                                                                              ndash PAPT = LTLT where T is banded using TSLU

                                                                              48

                                                                              0 0

                                                                              0

                                                                              0 0

                                                                              0

                                                                              0

                                                                              hellip

                                                                              hellip

                                                                              ndash Solvefactor narrow band problem with Tbull Up to 28x faster than MKL Best Paper at IPDPSrsquo13

                                                                              Other CA algorithms for Ax=b least squares (23)bull Minimizing bandwidth and latency for sequential GEPP

                                                                              ndash So far could not do partial pivoting and minimize messages just words

                                                                              ndash Challengebull Column layout good for choosing pivots bad for matmulbull Blocked layout good for matmul bad for choosing pivots

                                                                              ndash Solution use both layouts switching between thembull ldquoShape Morphing LUrdquo or SMLU

                                                                              49

                                                                              bull func factor(A) if A has 1 column update it else factor(left half of A)

                                                                              update right half of A

                                                                              factor(right half of A)

                                                                              bull Words = O(n3M12)

                                                                              bull Messages = O(n3M)

                                                                              bull func factor(A) if A has 1 column update it else factor(left half of A) reshape to recursive block format update right half of A reshape to columnwise format factor(right half of A)

                                                                              bull Words = O(n3M12)

                                                                              bull Messages = O(n3M32)

                                                                              Other CA algorithms for Ax=b least squares (33)bull Need for pivoting arises beyond LU in QR

                                                                              ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)

                                                                              ndash Usual approach like Partial Pivotingbull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound

                                                                              ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two

                                                                              groups of b columns either using usual approach or something better (GuEisenstat)

                                                                              bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix

                                                                              ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 50

                                                                              Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                              ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                              ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                              bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                              bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                              What about sparse matrices (13)

                                                                              bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

                                                                              bull But canrsquot reorder outer loop for 25D need another ideabull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = AB

                                                                              ndash Dependencies ok 25D works just different semiringbull Kleenersquos Algorithm

                                                                              52

                                                                              for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

                                                                              D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

                                                                              Performance of 25D APSP using Kleene

                                                                              53

                                                                              Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

                                                                              62xspeedup

                                                                              2x speedup

                                                                              What about sparse matrices (23)

                                                                              bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of G(A)

                                                                              have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

                                                                              bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering for

                                                                              2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

                                                                              bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

                                                                              multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

                                                                              separators)

                                                                              54

                                                                              What about sparse matrices (33)

                                                                              bull If matrix stays very sparse lower bound unattainable new one bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

                                                                              data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

                                                                              bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

                                                                              Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

                                                                              bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

                                                                              along dimensions most likely to minimize cost55

                                                                              Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                              ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                              ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                              bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                              bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                              Symmetric Eigenproblem and SVD

                                                                              bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd

                                                                              bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12

                                                                              ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea

                                                                              b+1

                                                                              b+1

                                                                              Successive Band Reduction (BischofLangSun)

                                                                              1

                                                                              b+1

                                                                              b+1

                                                                              d+1

                                                                              c

                                                                              Successive Band Reduction (BischofLangSun)

                                                                              b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                              1Q1

                                                                              b+1

                                                                              b+1

                                                                              d+1

                                                                              c

                                                                              b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                              Successive Band Reduction (BischofLangSun)

                                                                              12

                                                                              Q1

                                                                              b+1

                                                                              b+1

                                                                              d+1

                                                                              d+c

                                                                              d+c

                                                                              c

                                                                              b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                              Successive Band Reduction (BischofLangSun)

                                                                              1

                                                                              12

                                                                              Q1

                                                                              Q1T

                                                                              b+1

                                                                              b+1

                                                                              d+1

                                                                              d+1

                                                                              cd+c

                                                                              d+c

                                                                              c

                                                                              b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                              Successive Band Reduction (BischofLangSun)

                                                                              1

                                                                              1

                                                                              2

                                                                              2Q1

                                                                              Q1T

                                                                              b+1

                                                                              b+1

                                                                              d+1

                                                                              d+1

                                                                              cd+c

                                                                              d+c

                                                                              d+c

                                                                              d+c

                                                                              c

                                                                              b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                              Successive Band Reduction (BischofLangSun)

                                                                              1

                                                                              1

                                                                              2

                                                                              2

                                                                              3

                                                                              3

                                                                              Q1

                                                                              Q1T

                                                                              Q2

                                                                              Q2T

                                                                              b+1

                                                                              b+1

                                                                              d+1

                                                                              d+1

                                                                              d+c

                                                                              d+c

                                                                              d+c

                                                                              d+c

                                                                              c

                                                                              c

                                                                              b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                              Successive Band Reduction (BischofLangSun)

                                                                              1

                                                                              1

                                                                              2

                                                                              2

                                                                              3

                                                                              3

                                                                              4

                                                                              4

                                                                              Q1

                                                                              Q1T

                                                                              Q2

                                                                              Q2T

                                                                              Q3

                                                                              Q3T

                                                                              b+1

                                                                              b+1

                                                                              d+1

                                                                              d+1

                                                                              d+c

                                                                              d+c

                                                                              d+c

                                                                              d+c

                                                                              c

                                                                              c

                                                                              b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                              Successive Band Reduction (BischofLangSun)

                                                                              1

                                                                              1

                                                                              2

                                                                              2

                                                                              3

                                                                              3

                                                                              4

                                                                              4

                                                                              5

                                                                              5

                                                                              Q1

                                                                              Q1T

                                                                              Q2

                                                                              Q2T

                                                                              Q3

                                                                              Q3T

                                                                              Q4

                                                                              Q4T

                                                                              b+1

                                                                              b+1

                                                                              d+1

                                                                              d+1

                                                                              c

                                                                              c

                                                                              d+c

                                                                              d+c

                                                                              d+c

                                                                              d+c

                                                                              b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                              Successive Band Reduction (BischofLangSun)

                                                                              1

                                                                              1

                                                                              2

                                                                              2

                                                                              3

                                                                              3

                                                                              4

                                                                              4

                                                                              5

                                                                              5

                                                                              Q5T

                                                                              Q1

                                                                              Q1T

                                                                              Q2

                                                                              Q2T

                                                                              Q3

                                                                              Q3T

                                                                              Q5

                                                                              Q4

                                                                              Q4T

                                                                              b+1

                                                                              b+1

                                                                              d+1

                                                                              d+1

                                                                              c

                                                                              c

                                                                              d+c

                                                                              d+c

                                                                              d+c

                                                                              d+c

                                                                              b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                              Successive Band Reduction (BischofLangSun)

                                                                              1

                                                                              1

                                                                              2

                                                                              2

                                                                              3

                                                                              3

                                                                              4

                                                                              4

                                                                              5

                                                                              5

                                                                              6

                                                                              6

                                                                              Q5T

                                                                              Q1

                                                                              Q1T

                                                                              Q2

                                                                              Q2T

                                                                              Q3

                                                                              Q3T

                                                                              Q5

                                                                              Q4

                                                                              Q4T

                                                                              b+1

                                                                              b+1

                                                                              d+1

                                                                              d+1

                                                                              c

                                                                              c

                                                                              d+c

                                                                              d+c

                                                                              d+c

                                                                              d+c

                                                                              b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                              Successive Band Reduction (BischofLangSun)

                                                                              Conventional vs CA - SBR

                                                                              Conventional Communication-Avoiding

                                                                              Touch all data 4 times Touch all data once

                                                                              >
                                                                              >

                                                                              Speedups of Sym Band Reductionvs DSBTRD

                                                                              bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

                                                                              bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

                                                                              bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

                                                                              bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

                                                                              bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

                                                                              Nonsymmetric Eigenproblem

                                                                              bull No apparent way to modify standard algorithmbull Instead Spectral Divide-and-Conquer

                                                                              ndash Find orthogonal matrix Q whose leading columns span an invariant subspace of A

                                                                              ndash QTAQ will be block upper triangularndash Apply recursively to A11 A22

                                                                              ndash Depends on randomization1 Randomized Rank Revealing QR decomposition2 Randomized location to try splitting spectrum

                                                                              A11 A12

                                                                              ε A22

                                                                              Attaining the Lower bounds SequentialLegend[Existing][Ours][Math-Lib][Random]

                                                                              Two Levels Memory Hierarchy

                                                                              Words Messages Words Messages

                                                                              BLAS-3 [FLPRrsquo99][BDLSTrsquo13][MKL etc] [FLPRrsquo99][BDLSTrsquo13][MKL etc]

                                                                              Cholesky[Grsquo97][APrsquo00]

                                                                              [LAPACK][BDHSrsquo09]

                                                                              [Grsquo97][APrsquo00][BDHSrsquo09] [Grsquo97][APrsquo00][BDHSrsquo09]

                                                                              Sym Indefinite [BBDDDPSTYrsquo13] [BBDDDPSTYrsquo13]

                                                                              LU[Grsquo97][Trsquo97]

                                                                              [GDXrsquo11][BDLSTrsquo13]

                                                                              [GDXrsquo11][BDLSTrsquo13]

                                                                              [Grsquo97][Trsquo97] [BDLSTrsquo13] [BDLSTrsquo13]

                                                                              QR[EGrsquo98][FWrsquo03]

                                                                              [DGHLrsquo12][BDLSTrsquo13]

                                                                              [FWrsquo03][DGHLrsquo12][BDLSTrsquo13]

                                                                              [EGrsquo98][FWrsquo03][BDLSTrsquo13]

                                                                              [FWrsquo03][BDLSTrsquo13]

                                                                              Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

                                                                              Sym Eig amp SVD [BDDrsquo11][BDKrsquo13] [BDDrsquo11]

                                                                              Non Sym Eig [BDDrsquo11] [BDDrsquo11]

                                                                              Legend[Existing][Ours][Math-Lib][Random]

                                                                              Words (BW) Messages (L) Saving factor

                                                                              BLAS-3 [AGZrsquo94][MTrsquo99][ScaLAPACK][Crsquo69][vGWrsquo97][SDrsquo11] L nP12

                                                                              Cholesky [ScaLAPACK][Trsquo99][SDrsquo11] L nP12

                                                                              Sym Indefinite [BBDDDPSTYrsquo13][ScaLAPACK] [BBDDDPSTYrsquo13] L nP12

                                                                              LU [ScaLAPACK][GDXrsquo11][Trsquo99][SDrsquo11] [GDXrsquo11][Trsquo99][SDrsquo11] L nP12

                                                                              QR [ScaLAPACK][DGHLrsquo12] [Trsquo99] [DGHLrsquo12][Trsquo99] L nP12

                                                                              Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

                                                                              Sym Eig amp SVD [BDDrsquo11][BDKrsquo13][ScaLAPACK] [BDDrsquo11][BDKrsquo13] L nP12

                                                                              Non-Sym Eig [BDDrsquo11] [BDDrsquo11] BW P12 L n

                                                                              Attaining with extra memory 25D M=(cn2P)

                                                                              Attaining the Lower bounds Parallel 2DM=(n2P)(Ignoring poly-log(P) factors words = ( n2 P12) messages = (P12)

                                                                              Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                              ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                              ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                              bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                              bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                              Avoiding Communication in Iterative Linear Algebra

                                                                              bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

                                                                              bull Conjugate Gradients (CG) GMRES Lanczos Arnoldi hellip bull Goal minimize communication

                                                                              ndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

                                                                              bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

                                                                              ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

                                                                              bull Lots of speed up possible (modeled and measured)ndash Price some redundant computationndash Challenges Poor partitioning Preconditioning Num Stability

                                                                              75

                                                                              Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                              ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                              ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                              bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                              bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                              Example The Difficulty of Tuning SpMV

                                                                              bull n = 21200bull nnz = 15 M

                                                                              bull Source NASA structural analysis problem (raefsky)

                                                                              77

                                                                              Example The Difficulty of Tuning

                                                                              bull n = 21200bull nnz = 15 M

                                                                              bull Source NASA structural analysis problem (raefsky)

                                                                              bull 8x8 dense substructure exploit this to limit mem_refs

                                                                              78

                                                                              Speedups on Itanium 2 The Need for Search

                                                                              Reference

                                                                              Best 4x2

                                                                              Mflops

                                                                              Mflops

                                                                              79

                                                                              Register Profile Itanium 2

                                                                              190 Mflops

                                                                              1190 Mflops

                                                                              80

                                                                              Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16

                                                                              Itanium 2 - 33Itanium 1 - 8

                                                                              252 Mflops

                                                                              122 Mflops

                                                                              820 Mflops

                                                                              459 Mflops

                                                                              247 Mflops

                                                                              107 Mflops

                                                                              12 Gflops

                                                                              190 Mflops

                                                                              Another example of tuning challenges for SpMV

                                                                              bull Ex11 matrix (fluid flow)

                                                                              bull More complicated non-zero structure in general

                                                                              bull N = 16614bull NNZ = 11M

                                                                              82

                                                                              Zoom in to top corner

                                                                              bull More complicated non-zero structure in general

                                                                              bull N = 16614bull NNZ = 11M

                                                                              83

                                                                              3x3 blocks look natural buthellip

                                                                              bull Example 3x3 blockingndash Logical grid of 3x3 cells

                                                                              bull But would lead to lots of ldquofill-inrdquo

                                                                              84

                                                                              Extra Work Can Improve Efficiency

                                                                              bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

                                                                              bull On Pentium III 15x speedupndash Actual mflop rate 152 = 225 higher

                                                                              85

                                                                              Source Accelerator Cavity Design Problem (Ko via Husbands)

                                                                              86

                                                                              100x100 Submatrix Along Diagonal

                                                                              Summer School Lecture 787

                                                                              Post-RCM Reordering

                                                                              88

                                                                              Effect of Combined RCM+TSP Reordering

                                                                              Before Green + RedAfter Green + Blue

                                                                              Summer School Lecture 789

                                                                              2x speedups on Pentium 4 Power 4 hellip

                                                                              Summary of Other Performance Optimizations

                                                                              bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

                                                                              bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

                                                                              bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

                                                                              90

                                                                              Optimized Sparse Kernel Interface - OSKI

                                                                              bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

                                                                              bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

                                                                              bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

                                                                              software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

                                                                              91

                                                                              Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                              ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                              ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                              bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                              bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                              93

                                                                              Example Classical Conjugate Gradient (CG)

                                                                              SpMVs and dot products require communication in

                                                                              each iteration

                                                                              via CA Matrix Powers Kernel

                                                                              Global reduction to compute G

                                                                              94

                                                                              Example CA-Conjugate Gradient

                                                                              Local computations within inner loop require

                                                                              no communication

                                                                              Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                              ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                              ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                              bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                              bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                              96

                                                                              Slower convergence due

                                                                              to roundoff

                                                                              Loss of accuracy due to roundoff

                                                                              At s = 16 monomial basis is rank deficient Method breaks down

                                                                              Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

                                                                              CA-CG (monomial)CG

                                                                              machine precision

                                                                              97

                                                                              Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                              ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                              ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                              bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                              bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                              What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

                                                                              bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

                                                                              bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

                                                                              matrices

                                                                              Explicit (O(nnz)) Implicit (o(nnz))

                                                                              Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

                                                                              Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

                                                                              Indices

                                                                              Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                              ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                              ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                              bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                              bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                              101

                                                                              bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

                                                                              ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

                                                                              demanded by customers (construction engineers) otherwise they donrsquot believe results

                                                                              ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

                                                                              ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

                                                                              Reproducible Floating Point Computation

                                                                              Absolute Error for Random Vectors

                                                                              Same magnitude opposite signs

                                                                              Intel MKL non-reproducibility

                                                                              Relative Error for Orthogonal vectors

                                                                              Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

                                                                              Sign notreproducible

                                                                              103

                                                                              bull Consider summation or dot productbull Goals

                                                                              1 Same answer independent of layout processors order of summands

                                                                              2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

                                                                              bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

                                                                              GoalsApproaches for Reproducibility

                                                                              104

                                                                              Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

                                                                              Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

                                                                              Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

                                                                              bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

                                                                              Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

                                                                              bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

                                                                              Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

                                                                              Instruments NEC Nokia NVIDIA Samsung Oracle

                                                                              bull bebopcsberkeleyedu

                                                                              Summary

                                                                              Donrsquot Communichellip

                                                                              106

                                                                              Time to redesign all linear algebra n-body hellip algorithms and software

                                                                              (and compilers)

                                                                              • Implementing Communication-Avoiding Algorithms
                                                                              • Why avoid communication
                                                                              • Goals
                                                                              • Outline
                                                                              • Outline (2)
                                                                              • Lower bound for all ldquon3-likerdquo linear algebra
                                                                              • Lower bound for all ldquon3-likerdquo linear algebra (2)
                                                                              • Lower bound for all ldquon3-likerdquo linear algebra (3)
                                                                              • Limits to parallel scaling (12)
                                                                              • Limits to parallel scaling (22)
                                                                              • Can we attain these lower bounds
                                                                              • Outline (3)
                                                                              • 25D Matrix Multiplication
                                                                              • 25D Matrix Multiplication (2)
                                                                              • 25D Matmul on BGP 16K nodes 64K cores (2)
                                                                              • Perfect Strong Scaling ndash in Time and Energy (12)
                                                                              • Perfect Strong Scaling ndash in Time and Energy (22)
                                                                              • Handling Heterogeneity
                                                                              • Application to Tensor Contractions
                                                                              • C(ijk) = Σm A(ijm)B(mk)
                                                                              • Application to Tensor Contractions (2)
                                                                              • Communication Lower Bounds for Strassen-like matmul algorithms
                                                                              • vs
                                                                              • Slide 26
                                                                              • Strassen-like beyond matmul
                                                                              • Cache and Network Oblivious Algorithms
                                                                              • CARMA Performance Distributed Memory
                                                                              • CARMA Performance Distributed Memory (2)
                                                                              • CARMA Performance Shared Memory
                                                                              • CARMA Performance Shared Memory (2)
                                                                              • Why is CARMA Faster in Shared Memory
                                                                              • Outline (4)
                                                                              • One-sided Factorizations (LU QR) so far
                                                                              • TSQR An Architecture-Dependent Algorithm
                                                                              • Back to LU Using similar idea for TSLU as TSQR Use reduction
                                                                              • Minimizing Communication in TSLU
                                                                              • Making TSLU Numerically Stable
                                                                              • Stability of LU using TSLU CALU
                                                                              • Why is stability of TSLU just a ldquoThmrdquo
                                                                              • Fixing TSLU
                                                                              • 2D CALU with Tournament Pivoting
                                                                              • 25D CALU with Tournament Pivoting (c=4 copies)
                                                                              • Exascale Machine Parameters Source DOE Exascale Workshop
                                                                              • Exascale predicted speedups for Gaussian Elimination 2D CA
                                                                              • 25D vs 2D LU With and Without Pivoting
                                                                              • Other CA algorithms for Ax=b least squares(13)
                                                                              • Other CA algorithms for Ax=b least squares (23)
                                                                              • Other CA algorithms for Ax=b least squares (33)
                                                                              • Outline (5)
                                                                              • What about sparse matrices (13)
                                                                              • Performance of 25D APSP using Kleene
                                                                              • What about sparse matrices (23)
                                                                              • What about sparse matrices (33)
                                                                              • Outline (6)
                                                                              • Symmetric Eigenproblem and SVD
                                                                              • Slide 58
                                                                              • Slide 59
                                                                              • Slide 60
                                                                              • Slide 61
                                                                              • Slide 62
                                                                              • Slide 63
                                                                              • Slide 64
                                                                              • Slide 65
                                                                              • Slide 66
                                                                              • Slide 67
                                                                              • Slide 68
                                                                              • Conventional vs CA - SBR
                                                                              • Speedups of Sym Band Reduction vs DSBTRD
                                                                              • Nonsymmetric Eigenproblem
                                                                              • Attaining the Lower bounds Sequential
                                                                              • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
                                                                              • Outline (7)
                                                                              • Avoiding Communication in Iterative Linear Algebra
                                                                              • Outline (8)
                                                                              • Example The Difficulty of Tuning SpMV
                                                                              • Example The Difficulty of Tuning
                                                                              • Speedups on Itanium 2 The Need for Search
                                                                              • Register Profile Itanium 2
                                                                              • Register Profiles IBM and Intel IA-64
                                                                              • Another example of tuning challenges for SpMV
                                                                              • Zoom in to top corner
                                                                              • 3x3 blocks look natural buthellip
                                                                              • Extra Work Can Improve Efficiency
                                                                              • Slide 86
                                                                              • Slide 87
                                                                              • Slide 88
                                                                              • Slide 89
                                                                              • Summary of Other Performance Optimizations
                                                                              • Optimized Sparse Kernel Interface - OSKI
                                                                              • Outline (9)
                                                                              • Example Classical Conjugate Gradient (CG)
                                                                              • Example CA-Conjugate Gradient
                                                                              • Outline (10)
                                                                              • Slide 96
                                                                              • Slide 97
                                                                              • Outline (11)
                                                                              • What is a ldquosparse matrixrdquo
                                                                              • Outline (12)
                                                                              • Reproducible Floating Point Computation
                                                                              • Intel MKL non-reproducibility
                                                                              • GoalsApproaches for Reproducibility
                                                                              • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
                                                                              • Collaborators and Supporters
                                                                              • Summary

                                                                                Fixing TSLU

                                                                                bull Run TSLU quickly test for stability fix if necessary (rare)

                                                                                bull Test conditioning of U if not tiny (usual case) proceed elsebull Compute || L || if not big (usual case) proceed elsebull Factor A = QR using TSQR thenbull Factor Q = PLU using TSLU thenbull A = PL(UR) with UR as upper triangular factor

                                                                                bull Last topic in lecture how to guarantee floating point reproducibility

                                                                                42

                                                                                2D CALU with Tournament Pivoting

                                                                                43

                                                                                25D CALU with Tournament Pivoting (c=4 copies)

                                                                                44

                                                                                Exascale Machine ParametersSource DOE Exascale Workshop

                                                                                bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

                                                                                Exascale predicted speedupsfor Gaussian Elimination

                                                                                2D CA-LU vs ScaLAPACK-LU

                                                                                log2 (p)

                                                                                log 2

                                                                                (n2 p

                                                                                ) =

                                                                                log 2

                                                                                (mem

                                                                                ory_

                                                                                per_

                                                                                proc

                                                                                )

                                                                                Up to 29x

                                                                                25D vs 2D LUWith and Without Pivoting

                                                                                Other CA algorithms for Ax=b least squares(13)

                                                                                bull A symmetric and indefinitendash Seek factorization that retains symmetry PAPT = LDLT D

                                                                                ldquosimplerdquobull Save frac12 flops preserve inertia

                                                                                ndash Usual approach Bunch-Kaufmanbull D block diagonal with 1x1 and 2x2 blocksbull Pivot search down column along row (lots of communication)

                                                                                ndash Alternative Aasenbull D = tridiagonal = Tbull Two steps

                                                                                ndash PAPT = LTLT where T is banded using TSLU

                                                                                48

                                                                                0 0

                                                                                0

                                                                                0 0

                                                                                0

                                                                                0

                                                                                hellip

                                                                                hellip

                                                                                ndash Solvefactor narrow band problem with Tbull Up to 28x faster than MKL Best Paper at IPDPSrsquo13

                                                                                Other CA algorithms for Ax=b least squares (23)bull Minimizing bandwidth and latency for sequential GEPP

                                                                                ndash So far could not do partial pivoting and minimize messages just words

                                                                                ndash Challengebull Column layout good for choosing pivots bad for matmulbull Blocked layout good for matmul bad for choosing pivots

                                                                                ndash Solution use both layouts switching between thembull ldquoShape Morphing LUrdquo or SMLU

                                                                                49

                                                                                bull func factor(A) if A has 1 column update it else factor(left half of A)

                                                                                update right half of A

                                                                                factor(right half of A)

                                                                                bull Words = O(n3M12)

                                                                                bull Messages = O(n3M)

                                                                                bull func factor(A) if A has 1 column update it else factor(left half of A) reshape to recursive block format update right half of A reshape to columnwise format factor(right half of A)

                                                                                bull Words = O(n3M12)

                                                                                bull Messages = O(n3M32)

                                                                                Other CA algorithms for Ax=b least squares (33)bull Need for pivoting arises beyond LU in QR

                                                                                ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)

                                                                                ndash Usual approach like Partial Pivotingbull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound

                                                                                ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two

                                                                                groups of b columns either using usual approach or something better (GuEisenstat)

                                                                                bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix

                                                                                ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 50

                                                                                Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                What about sparse matrices (13)

                                                                                bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

                                                                                bull But canrsquot reorder outer loop for 25D need another ideabull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = AB

                                                                                ndash Dependencies ok 25D works just different semiringbull Kleenersquos Algorithm

                                                                                52

                                                                                for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

                                                                                D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

                                                                                Performance of 25D APSP using Kleene

                                                                                53

                                                                                Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

                                                                                62xspeedup

                                                                                2x speedup

                                                                                What about sparse matrices (23)

                                                                                bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of G(A)

                                                                                have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

                                                                                bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering for

                                                                                2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

                                                                                bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

                                                                                multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

                                                                                separators)

                                                                                54

                                                                                What about sparse matrices (33)

                                                                                bull If matrix stays very sparse lower bound unattainable new one bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

                                                                                data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

                                                                                bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

                                                                                Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

                                                                                bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

                                                                                along dimensions most likely to minimize cost55

                                                                                Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                Symmetric Eigenproblem and SVD

                                                                                bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd

                                                                                bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12

                                                                                ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea

                                                                                b+1

                                                                                b+1

                                                                                Successive Band Reduction (BischofLangSun)

                                                                                1

                                                                                b+1

                                                                                b+1

                                                                                d+1

                                                                                c

                                                                                Successive Band Reduction (BischofLangSun)

                                                                                b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                1Q1

                                                                                b+1

                                                                                b+1

                                                                                d+1

                                                                                c

                                                                                b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                Successive Band Reduction (BischofLangSun)

                                                                                12

                                                                                Q1

                                                                                b+1

                                                                                b+1

                                                                                d+1

                                                                                d+c

                                                                                d+c

                                                                                c

                                                                                b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                Successive Band Reduction (BischofLangSun)

                                                                                1

                                                                                12

                                                                                Q1

                                                                                Q1T

                                                                                b+1

                                                                                b+1

                                                                                d+1

                                                                                d+1

                                                                                cd+c

                                                                                d+c

                                                                                c

                                                                                b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                Successive Band Reduction (BischofLangSun)

                                                                                1

                                                                                1

                                                                                2

                                                                                2Q1

                                                                                Q1T

                                                                                b+1

                                                                                b+1

                                                                                d+1

                                                                                d+1

                                                                                cd+c

                                                                                d+c

                                                                                d+c

                                                                                d+c

                                                                                c

                                                                                b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                Successive Band Reduction (BischofLangSun)

                                                                                1

                                                                                1

                                                                                2

                                                                                2

                                                                                3

                                                                                3

                                                                                Q1

                                                                                Q1T

                                                                                Q2

                                                                                Q2T

                                                                                b+1

                                                                                b+1

                                                                                d+1

                                                                                d+1

                                                                                d+c

                                                                                d+c

                                                                                d+c

                                                                                d+c

                                                                                c

                                                                                c

                                                                                b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                Successive Band Reduction (BischofLangSun)

                                                                                1

                                                                                1

                                                                                2

                                                                                2

                                                                                3

                                                                                3

                                                                                4

                                                                                4

                                                                                Q1

                                                                                Q1T

                                                                                Q2

                                                                                Q2T

                                                                                Q3

                                                                                Q3T

                                                                                b+1

                                                                                b+1

                                                                                d+1

                                                                                d+1

                                                                                d+c

                                                                                d+c

                                                                                d+c

                                                                                d+c

                                                                                c

                                                                                c

                                                                                b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                Successive Band Reduction (BischofLangSun)

                                                                                1

                                                                                1

                                                                                2

                                                                                2

                                                                                3

                                                                                3

                                                                                4

                                                                                4

                                                                                5

                                                                                5

                                                                                Q1

                                                                                Q1T

                                                                                Q2

                                                                                Q2T

                                                                                Q3

                                                                                Q3T

                                                                                Q4

                                                                                Q4T

                                                                                b+1

                                                                                b+1

                                                                                d+1

                                                                                d+1

                                                                                c

                                                                                c

                                                                                d+c

                                                                                d+c

                                                                                d+c

                                                                                d+c

                                                                                b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                Successive Band Reduction (BischofLangSun)

                                                                                1

                                                                                1

                                                                                2

                                                                                2

                                                                                3

                                                                                3

                                                                                4

                                                                                4

                                                                                5

                                                                                5

                                                                                Q5T

                                                                                Q1

                                                                                Q1T

                                                                                Q2

                                                                                Q2T

                                                                                Q3

                                                                                Q3T

                                                                                Q5

                                                                                Q4

                                                                                Q4T

                                                                                b+1

                                                                                b+1

                                                                                d+1

                                                                                d+1

                                                                                c

                                                                                c

                                                                                d+c

                                                                                d+c

                                                                                d+c

                                                                                d+c

                                                                                b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                Successive Band Reduction (BischofLangSun)

                                                                                1

                                                                                1

                                                                                2

                                                                                2

                                                                                3

                                                                                3

                                                                                4

                                                                                4

                                                                                5

                                                                                5

                                                                                6

                                                                                6

                                                                                Q5T

                                                                                Q1

                                                                                Q1T

                                                                                Q2

                                                                                Q2T

                                                                                Q3

                                                                                Q3T

                                                                                Q5

                                                                                Q4

                                                                                Q4T

                                                                                b+1

                                                                                b+1

                                                                                d+1

                                                                                d+1

                                                                                c

                                                                                c

                                                                                d+c

                                                                                d+c

                                                                                d+c

                                                                                d+c

                                                                                b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                Successive Band Reduction (BischofLangSun)

                                                                                Conventional vs CA - SBR

                                                                                Conventional Communication-Avoiding

                                                                                Touch all data 4 times Touch all data once

                                                                                >
                                                                                >

                                                                                Speedups of Sym Band Reductionvs DSBTRD

                                                                                bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

                                                                                bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

                                                                                bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

                                                                                bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

                                                                                bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

                                                                                Nonsymmetric Eigenproblem

                                                                                bull No apparent way to modify standard algorithmbull Instead Spectral Divide-and-Conquer

                                                                                ndash Find orthogonal matrix Q whose leading columns span an invariant subspace of A

                                                                                ndash QTAQ will be block upper triangularndash Apply recursively to A11 A22

                                                                                ndash Depends on randomization1 Randomized Rank Revealing QR decomposition2 Randomized location to try splitting spectrum

                                                                                A11 A12

                                                                                ε A22

                                                                                Attaining the Lower bounds SequentialLegend[Existing][Ours][Math-Lib][Random]

                                                                                Two Levels Memory Hierarchy

                                                                                Words Messages Words Messages

                                                                                BLAS-3 [FLPRrsquo99][BDLSTrsquo13][MKL etc] [FLPRrsquo99][BDLSTrsquo13][MKL etc]

                                                                                Cholesky[Grsquo97][APrsquo00]

                                                                                [LAPACK][BDHSrsquo09]

                                                                                [Grsquo97][APrsquo00][BDHSrsquo09] [Grsquo97][APrsquo00][BDHSrsquo09]

                                                                                Sym Indefinite [BBDDDPSTYrsquo13] [BBDDDPSTYrsquo13]

                                                                                LU[Grsquo97][Trsquo97]

                                                                                [GDXrsquo11][BDLSTrsquo13]

                                                                                [GDXrsquo11][BDLSTrsquo13]

                                                                                [Grsquo97][Trsquo97] [BDLSTrsquo13] [BDLSTrsquo13]

                                                                                QR[EGrsquo98][FWrsquo03]

                                                                                [DGHLrsquo12][BDLSTrsquo13]

                                                                                [FWrsquo03][DGHLrsquo12][BDLSTrsquo13]

                                                                                [EGrsquo98][FWrsquo03][BDLSTrsquo13]

                                                                                [FWrsquo03][BDLSTrsquo13]

                                                                                Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

                                                                                Sym Eig amp SVD [BDDrsquo11][BDKrsquo13] [BDDrsquo11]

                                                                                Non Sym Eig [BDDrsquo11] [BDDrsquo11]

                                                                                Legend[Existing][Ours][Math-Lib][Random]

                                                                                Words (BW) Messages (L) Saving factor

                                                                                BLAS-3 [AGZrsquo94][MTrsquo99][ScaLAPACK][Crsquo69][vGWrsquo97][SDrsquo11] L nP12

                                                                                Cholesky [ScaLAPACK][Trsquo99][SDrsquo11] L nP12

                                                                                Sym Indefinite [BBDDDPSTYrsquo13][ScaLAPACK] [BBDDDPSTYrsquo13] L nP12

                                                                                LU [ScaLAPACK][GDXrsquo11][Trsquo99][SDrsquo11] [GDXrsquo11][Trsquo99][SDrsquo11] L nP12

                                                                                QR [ScaLAPACK][DGHLrsquo12] [Trsquo99] [DGHLrsquo12][Trsquo99] L nP12

                                                                                Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

                                                                                Sym Eig amp SVD [BDDrsquo11][BDKrsquo13][ScaLAPACK] [BDDrsquo11][BDKrsquo13] L nP12

                                                                                Non-Sym Eig [BDDrsquo11] [BDDrsquo11] BW P12 L n

                                                                                Attaining with extra memory 25D M=(cn2P)

                                                                                Attaining the Lower bounds Parallel 2DM=(n2P)(Ignoring poly-log(P) factors words = ( n2 P12) messages = (P12)

                                                                                Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                Avoiding Communication in Iterative Linear Algebra

                                                                                bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

                                                                                bull Conjugate Gradients (CG) GMRES Lanczos Arnoldi hellip bull Goal minimize communication

                                                                                ndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

                                                                                bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

                                                                                ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

                                                                                bull Lots of speed up possible (modeled and measured)ndash Price some redundant computationndash Challenges Poor partitioning Preconditioning Num Stability

                                                                                75

                                                                                Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                Example The Difficulty of Tuning SpMV

                                                                                bull n = 21200bull nnz = 15 M

                                                                                bull Source NASA structural analysis problem (raefsky)

                                                                                77

                                                                                Example The Difficulty of Tuning

                                                                                bull n = 21200bull nnz = 15 M

                                                                                bull Source NASA structural analysis problem (raefsky)

                                                                                bull 8x8 dense substructure exploit this to limit mem_refs

                                                                                78

                                                                                Speedups on Itanium 2 The Need for Search

                                                                                Reference

                                                                                Best 4x2

                                                                                Mflops

                                                                                Mflops

                                                                                79

                                                                                Register Profile Itanium 2

                                                                                190 Mflops

                                                                                1190 Mflops

                                                                                80

                                                                                Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16

                                                                                Itanium 2 - 33Itanium 1 - 8

                                                                                252 Mflops

                                                                                122 Mflops

                                                                                820 Mflops

                                                                                459 Mflops

                                                                                247 Mflops

                                                                                107 Mflops

                                                                                12 Gflops

                                                                                190 Mflops

                                                                                Another example of tuning challenges for SpMV

                                                                                bull Ex11 matrix (fluid flow)

                                                                                bull More complicated non-zero structure in general

                                                                                bull N = 16614bull NNZ = 11M

                                                                                82

                                                                                Zoom in to top corner

                                                                                bull More complicated non-zero structure in general

                                                                                bull N = 16614bull NNZ = 11M

                                                                                83

                                                                                3x3 blocks look natural buthellip

                                                                                bull Example 3x3 blockingndash Logical grid of 3x3 cells

                                                                                bull But would lead to lots of ldquofill-inrdquo

                                                                                84

                                                                                Extra Work Can Improve Efficiency

                                                                                bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

                                                                                bull On Pentium III 15x speedupndash Actual mflop rate 152 = 225 higher

                                                                                85

                                                                                Source Accelerator Cavity Design Problem (Ko via Husbands)

                                                                                86

                                                                                100x100 Submatrix Along Diagonal

                                                                                Summer School Lecture 787

                                                                                Post-RCM Reordering

                                                                                88

                                                                                Effect of Combined RCM+TSP Reordering

                                                                                Before Green + RedAfter Green + Blue

                                                                                Summer School Lecture 789

                                                                                2x speedups on Pentium 4 Power 4 hellip

                                                                                Summary of Other Performance Optimizations

                                                                                bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

                                                                                bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

                                                                                bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

                                                                                90

                                                                                Optimized Sparse Kernel Interface - OSKI

                                                                                bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

                                                                                bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

                                                                                bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

                                                                                software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

                                                                                91

                                                                                Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                93

                                                                                Example Classical Conjugate Gradient (CG)

                                                                                SpMVs and dot products require communication in

                                                                                each iteration

                                                                                via CA Matrix Powers Kernel

                                                                                Global reduction to compute G

                                                                                94

                                                                                Example CA-Conjugate Gradient

                                                                                Local computations within inner loop require

                                                                                no communication

                                                                                Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                96

                                                                                Slower convergence due

                                                                                to roundoff

                                                                                Loss of accuracy due to roundoff

                                                                                At s = 16 monomial basis is rank deficient Method breaks down

                                                                                Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

                                                                                CA-CG (monomial)CG

                                                                                machine precision

                                                                                97

                                                                                Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

                                                                                bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

                                                                                bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

                                                                                matrices

                                                                                Explicit (O(nnz)) Implicit (o(nnz))

                                                                                Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

                                                                                Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

                                                                                Indices

                                                                                Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                101

                                                                                bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

                                                                                ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

                                                                                demanded by customers (construction engineers) otherwise they donrsquot believe results

                                                                                ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

                                                                                ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

                                                                                Reproducible Floating Point Computation

                                                                                Absolute Error for Random Vectors

                                                                                Same magnitude opposite signs

                                                                                Intel MKL non-reproducibility

                                                                                Relative Error for Orthogonal vectors

                                                                                Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

                                                                                Sign notreproducible

                                                                                103

                                                                                bull Consider summation or dot productbull Goals

                                                                                1 Same answer independent of layout processors order of summands

                                                                                2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

                                                                                bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

                                                                                GoalsApproaches for Reproducibility

                                                                                104

                                                                                Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

                                                                                Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

                                                                                Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

                                                                                bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

                                                                                Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

                                                                                bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

                                                                                Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

                                                                                Instruments NEC Nokia NVIDIA Samsung Oracle

                                                                                bull bebopcsberkeleyedu

                                                                                Summary

                                                                                Donrsquot Communichellip

                                                                                106

                                                                                Time to redesign all linear algebra n-body hellip algorithms and software

                                                                                (and compilers)

                                                                                • Implementing Communication-Avoiding Algorithms
                                                                                • Why avoid communication
                                                                                • Goals
                                                                                • Outline
                                                                                • Outline (2)
                                                                                • Lower bound for all ldquon3-likerdquo linear algebra
                                                                                • Lower bound for all ldquon3-likerdquo linear algebra (2)
                                                                                • Lower bound for all ldquon3-likerdquo linear algebra (3)
                                                                                • Limits to parallel scaling (12)
                                                                                • Limits to parallel scaling (22)
                                                                                • Can we attain these lower bounds
                                                                                • Outline (3)
                                                                                • 25D Matrix Multiplication
                                                                                • 25D Matrix Multiplication (2)
                                                                                • 25D Matmul on BGP 16K nodes 64K cores (2)
                                                                                • Perfect Strong Scaling ndash in Time and Energy (12)
                                                                                • Perfect Strong Scaling ndash in Time and Energy (22)
                                                                                • Handling Heterogeneity
                                                                                • Application to Tensor Contractions
                                                                                • C(ijk) = Σm A(ijm)B(mk)
                                                                                • Application to Tensor Contractions (2)
                                                                                • Communication Lower Bounds for Strassen-like matmul algorithms
                                                                                • vs
                                                                                • Slide 26
                                                                                • Strassen-like beyond matmul
                                                                                • Cache and Network Oblivious Algorithms
                                                                                • CARMA Performance Distributed Memory
                                                                                • CARMA Performance Distributed Memory (2)
                                                                                • CARMA Performance Shared Memory
                                                                                • CARMA Performance Shared Memory (2)
                                                                                • Why is CARMA Faster in Shared Memory
                                                                                • Outline (4)
                                                                                • One-sided Factorizations (LU QR) so far
                                                                                • TSQR An Architecture-Dependent Algorithm
                                                                                • Back to LU Using similar idea for TSLU as TSQR Use reduction
                                                                                • Minimizing Communication in TSLU
                                                                                • Making TSLU Numerically Stable
                                                                                • Stability of LU using TSLU CALU
                                                                                • Why is stability of TSLU just a ldquoThmrdquo
                                                                                • Fixing TSLU
                                                                                • 2D CALU with Tournament Pivoting
                                                                                • 25D CALU with Tournament Pivoting (c=4 copies)
                                                                                • Exascale Machine Parameters Source DOE Exascale Workshop
                                                                                • Exascale predicted speedups for Gaussian Elimination 2D CA
                                                                                • 25D vs 2D LU With and Without Pivoting
                                                                                • Other CA algorithms for Ax=b least squares(13)
                                                                                • Other CA algorithms for Ax=b least squares (23)
                                                                                • Other CA algorithms for Ax=b least squares (33)
                                                                                • Outline (5)
                                                                                • What about sparse matrices (13)
                                                                                • Performance of 25D APSP using Kleene
                                                                                • What about sparse matrices (23)
                                                                                • What about sparse matrices (33)
                                                                                • Outline (6)
                                                                                • Symmetric Eigenproblem and SVD
                                                                                • Slide 58
                                                                                • Slide 59
                                                                                • Slide 60
                                                                                • Slide 61
                                                                                • Slide 62
                                                                                • Slide 63
                                                                                • Slide 64
                                                                                • Slide 65
                                                                                • Slide 66
                                                                                • Slide 67
                                                                                • Slide 68
                                                                                • Conventional vs CA - SBR
                                                                                • Speedups of Sym Band Reduction vs DSBTRD
                                                                                • Nonsymmetric Eigenproblem
                                                                                • Attaining the Lower bounds Sequential
                                                                                • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
                                                                                • Outline (7)
                                                                                • Avoiding Communication in Iterative Linear Algebra
                                                                                • Outline (8)
                                                                                • Example The Difficulty of Tuning SpMV
                                                                                • Example The Difficulty of Tuning
                                                                                • Speedups on Itanium 2 The Need for Search
                                                                                • Register Profile Itanium 2
                                                                                • Register Profiles IBM and Intel IA-64
                                                                                • Another example of tuning challenges for SpMV
                                                                                • Zoom in to top corner
                                                                                • 3x3 blocks look natural buthellip
                                                                                • Extra Work Can Improve Efficiency
                                                                                • Slide 86
                                                                                • Slide 87
                                                                                • Slide 88
                                                                                • Slide 89
                                                                                • Summary of Other Performance Optimizations
                                                                                • Optimized Sparse Kernel Interface - OSKI
                                                                                • Outline (9)
                                                                                • Example Classical Conjugate Gradient (CG)
                                                                                • Example CA-Conjugate Gradient
                                                                                • Outline (10)
                                                                                • Slide 96
                                                                                • Slide 97
                                                                                • Outline (11)
                                                                                • What is a ldquosparse matrixrdquo
                                                                                • Outline (12)
                                                                                • Reproducible Floating Point Computation
                                                                                • Intel MKL non-reproducibility
                                                                                • GoalsApproaches for Reproducibility
                                                                                • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
                                                                                • Collaborators and Supporters
                                                                                • Summary

                                                                                  2D CALU with Tournament Pivoting

                                                                                  43

                                                                                  25D CALU with Tournament Pivoting (c=4 copies)

                                                                                  44

                                                                                  Exascale Machine ParametersSource DOE Exascale Workshop

                                                                                  bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

                                                                                  Exascale predicted speedupsfor Gaussian Elimination

                                                                                  2D CA-LU vs ScaLAPACK-LU

                                                                                  log2 (p)

                                                                                  log 2

                                                                                  (n2 p

                                                                                  ) =

                                                                                  log 2

                                                                                  (mem

                                                                                  ory_

                                                                                  per_

                                                                                  proc

                                                                                  )

                                                                                  Up to 29x

                                                                                  25D vs 2D LUWith and Without Pivoting

                                                                                  Other CA algorithms for Ax=b least squares(13)

                                                                                  bull A symmetric and indefinitendash Seek factorization that retains symmetry PAPT = LDLT D

                                                                                  ldquosimplerdquobull Save frac12 flops preserve inertia

                                                                                  ndash Usual approach Bunch-Kaufmanbull D block diagonal with 1x1 and 2x2 blocksbull Pivot search down column along row (lots of communication)

                                                                                  ndash Alternative Aasenbull D = tridiagonal = Tbull Two steps

                                                                                  ndash PAPT = LTLT where T is banded using TSLU

                                                                                  48

                                                                                  0 0

                                                                                  0

                                                                                  0 0

                                                                                  0

                                                                                  0

                                                                                  hellip

                                                                                  hellip

                                                                                  ndash Solvefactor narrow band problem with Tbull Up to 28x faster than MKL Best Paper at IPDPSrsquo13

                                                                                  Other CA algorithms for Ax=b least squares (23)bull Minimizing bandwidth and latency for sequential GEPP

                                                                                  ndash So far could not do partial pivoting and minimize messages just words

                                                                                  ndash Challengebull Column layout good for choosing pivots bad for matmulbull Blocked layout good for matmul bad for choosing pivots

                                                                                  ndash Solution use both layouts switching between thembull ldquoShape Morphing LUrdquo or SMLU

                                                                                  49

                                                                                  bull func factor(A) if A has 1 column update it else factor(left half of A)

                                                                                  update right half of A

                                                                                  factor(right half of A)

                                                                                  bull Words = O(n3M12)

                                                                                  bull Messages = O(n3M)

                                                                                  bull func factor(A) if A has 1 column update it else factor(left half of A) reshape to recursive block format update right half of A reshape to columnwise format factor(right half of A)

                                                                                  bull Words = O(n3M12)

                                                                                  bull Messages = O(n3M32)

                                                                                  Other CA algorithms for Ax=b least squares (33)bull Need for pivoting arises beyond LU in QR

                                                                                  ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)

                                                                                  ndash Usual approach like Partial Pivotingbull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound

                                                                                  ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two

                                                                                  groups of b columns either using usual approach or something better (GuEisenstat)

                                                                                  bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix

                                                                                  ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 50

                                                                                  Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                  ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                  ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                  bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                  bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                  What about sparse matrices (13)

                                                                                  bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

                                                                                  bull But canrsquot reorder outer loop for 25D need another ideabull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = AB

                                                                                  ndash Dependencies ok 25D works just different semiringbull Kleenersquos Algorithm

                                                                                  52

                                                                                  for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

                                                                                  D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

                                                                                  Performance of 25D APSP using Kleene

                                                                                  53

                                                                                  Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

                                                                                  62xspeedup

                                                                                  2x speedup

                                                                                  What about sparse matrices (23)

                                                                                  bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of G(A)

                                                                                  have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

                                                                                  bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering for

                                                                                  2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

                                                                                  bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

                                                                                  multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

                                                                                  separators)

                                                                                  54

                                                                                  What about sparse matrices (33)

                                                                                  bull If matrix stays very sparse lower bound unattainable new one bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

                                                                                  data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

                                                                                  bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

                                                                                  Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

                                                                                  bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

                                                                                  along dimensions most likely to minimize cost55

                                                                                  Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                  ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                  ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                  bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                  bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                  Symmetric Eigenproblem and SVD

                                                                                  bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd

                                                                                  bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12

                                                                                  ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea

                                                                                  b+1

                                                                                  b+1

                                                                                  Successive Band Reduction (BischofLangSun)

                                                                                  1

                                                                                  b+1

                                                                                  b+1

                                                                                  d+1

                                                                                  c

                                                                                  Successive Band Reduction (BischofLangSun)

                                                                                  b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                  1Q1

                                                                                  b+1

                                                                                  b+1

                                                                                  d+1

                                                                                  c

                                                                                  b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                  Successive Band Reduction (BischofLangSun)

                                                                                  12

                                                                                  Q1

                                                                                  b+1

                                                                                  b+1

                                                                                  d+1

                                                                                  d+c

                                                                                  d+c

                                                                                  c

                                                                                  b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                  Successive Band Reduction (BischofLangSun)

                                                                                  1

                                                                                  12

                                                                                  Q1

                                                                                  Q1T

                                                                                  b+1

                                                                                  b+1

                                                                                  d+1

                                                                                  d+1

                                                                                  cd+c

                                                                                  d+c

                                                                                  c

                                                                                  b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                  Successive Band Reduction (BischofLangSun)

                                                                                  1

                                                                                  1

                                                                                  2

                                                                                  2Q1

                                                                                  Q1T

                                                                                  b+1

                                                                                  b+1

                                                                                  d+1

                                                                                  d+1

                                                                                  cd+c

                                                                                  d+c

                                                                                  d+c

                                                                                  d+c

                                                                                  c

                                                                                  b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                  Successive Band Reduction (BischofLangSun)

                                                                                  1

                                                                                  1

                                                                                  2

                                                                                  2

                                                                                  3

                                                                                  3

                                                                                  Q1

                                                                                  Q1T

                                                                                  Q2

                                                                                  Q2T

                                                                                  b+1

                                                                                  b+1

                                                                                  d+1

                                                                                  d+1

                                                                                  d+c

                                                                                  d+c

                                                                                  d+c

                                                                                  d+c

                                                                                  c

                                                                                  c

                                                                                  b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                  Successive Band Reduction (BischofLangSun)

                                                                                  1

                                                                                  1

                                                                                  2

                                                                                  2

                                                                                  3

                                                                                  3

                                                                                  4

                                                                                  4

                                                                                  Q1

                                                                                  Q1T

                                                                                  Q2

                                                                                  Q2T

                                                                                  Q3

                                                                                  Q3T

                                                                                  b+1

                                                                                  b+1

                                                                                  d+1

                                                                                  d+1

                                                                                  d+c

                                                                                  d+c

                                                                                  d+c

                                                                                  d+c

                                                                                  c

                                                                                  c

                                                                                  b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                  Successive Band Reduction (BischofLangSun)

                                                                                  1

                                                                                  1

                                                                                  2

                                                                                  2

                                                                                  3

                                                                                  3

                                                                                  4

                                                                                  4

                                                                                  5

                                                                                  5

                                                                                  Q1

                                                                                  Q1T

                                                                                  Q2

                                                                                  Q2T

                                                                                  Q3

                                                                                  Q3T

                                                                                  Q4

                                                                                  Q4T

                                                                                  b+1

                                                                                  b+1

                                                                                  d+1

                                                                                  d+1

                                                                                  c

                                                                                  c

                                                                                  d+c

                                                                                  d+c

                                                                                  d+c

                                                                                  d+c

                                                                                  b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                  Successive Band Reduction (BischofLangSun)

                                                                                  1

                                                                                  1

                                                                                  2

                                                                                  2

                                                                                  3

                                                                                  3

                                                                                  4

                                                                                  4

                                                                                  5

                                                                                  5

                                                                                  Q5T

                                                                                  Q1

                                                                                  Q1T

                                                                                  Q2

                                                                                  Q2T

                                                                                  Q3

                                                                                  Q3T

                                                                                  Q5

                                                                                  Q4

                                                                                  Q4T

                                                                                  b+1

                                                                                  b+1

                                                                                  d+1

                                                                                  d+1

                                                                                  c

                                                                                  c

                                                                                  d+c

                                                                                  d+c

                                                                                  d+c

                                                                                  d+c

                                                                                  b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                  Successive Band Reduction (BischofLangSun)

                                                                                  1

                                                                                  1

                                                                                  2

                                                                                  2

                                                                                  3

                                                                                  3

                                                                                  4

                                                                                  4

                                                                                  5

                                                                                  5

                                                                                  6

                                                                                  6

                                                                                  Q5T

                                                                                  Q1

                                                                                  Q1T

                                                                                  Q2

                                                                                  Q2T

                                                                                  Q3

                                                                                  Q3T

                                                                                  Q5

                                                                                  Q4

                                                                                  Q4T

                                                                                  b+1

                                                                                  b+1

                                                                                  d+1

                                                                                  d+1

                                                                                  c

                                                                                  c

                                                                                  d+c

                                                                                  d+c

                                                                                  d+c

                                                                                  d+c

                                                                                  b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                  Successive Band Reduction (BischofLangSun)

                                                                                  Conventional vs CA - SBR

                                                                                  Conventional Communication-Avoiding

                                                                                  Touch all data 4 times Touch all data once

                                                                                  >
                                                                                  >

                                                                                  Speedups of Sym Band Reductionvs DSBTRD

                                                                                  bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

                                                                                  bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

                                                                                  bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

                                                                                  bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

                                                                                  bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

                                                                                  Nonsymmetric Eigenproblem

                                                                                  bull No apparent way to modify standard algorithmbull Instead Spectral Divide-and-Conquer

                                                                                  ndash Find orthogonal matrix Q whose leading columns span an invariant subspace of A

                                                                                  ndash QTAQ will be block upper triangularndash Apply recursively to A11 A22

                                                                                  ndash Depends on randomization1 Randomized Rank Revealing QR decomposition2 Randomized location to try splitting spectrum

                                                                                  A11 A12

                                                                                  ε A22

                                                                                  Attaining the Lower bounds SequentialLegend[Existing][Ours][Math-Lib][Random]

                                                                                  Two Levels Memory Hierarchy

                                                                                  Words Messages Words Messages

                                                                                  BLAS-3 [FLPRrsquo99][BDLSTrsquo13][MKL etc] [FLPRrsquo99][BDLSTrsquo13][MKL etc]

                                                                                  Cholesky[Grsquo97][APrsquo00]

                                                                                  [LAPACK][BDHSrsquo09]

                                                                                  [Grsquo97][APrsquo00][BDHSrsquo09] [Grsquo97][APrsquo00][BDHSrsquo09]

                                                                                  Sym Indefinite [BBDDDPSTYrsquo13] [BBDDDPSTYrsquo13]

                                                                                  LU[Grsquo97][Trsquo97]

                                                                                  [GDXrsquo11][BDLSTrsquo13]

                                                                                  [GDXrsquo11][BDLSTrsquo13]

                                                                                  [Grsquo97][Trsquo97] [BDLSTrsquo13] [BDLSTrsquo13]

                                                                                  QR[EGrsquo98][FWrsquo03]

                                                                                  [DGHLrsquo12][BDLSTrsquo13]

                                                                                  [FWrsquo03][DGHLrsquo12][BDLSTrsquo13]

                                                                                  [EGrsquo98][FWrsquo03][BDLSTrsquo13]

                                                                                  [FWrsquo03][BDLSTrsquo13]

                                                                                  Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

                                                                                  Sym Eig amp SVD [BDDrsquo11][BDKrsquo13] [BDDrsquo11]

                                                                                  Non Sym Eig [BDDrsquo11] [BDDrsquo11]

                                                                                  Legend[Existing][Ours][Math-Lib][Random]

                                                                                  Words (BW) Messages (L) Saving factor

                                                                                  BLAS-3 [AGZrsquo94][MTrsquo99][ScaLAPACK][Crsquo69][vGWrsquo97][SDrsquo11] L nP12

                                                                                  Cholesky [ScaLAPACK][Trsquo99][SDrsquo11] L nP12

                                                                                  Sym Indefinite [BBDDDPSTYrsquo13][ScaLAPACK] [BBDDDPSTYrsquo13] L nP12

                                                                                  LU [ScaLAPACK][GDXrsquo11][Trsquo99][SDrsquo11] [GDXrsquo11][Trsquo99][SDrsquo11] L nP12

                                                                                  QR [ScaLAPACK][DGHLrsquo12] [Trsquo99] [DGHLrsquo12][Trsquo99] L nP12

                                                                                  Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

                                                                                  Sym Eig amp SVD [BDDrsquo11][BDKrsquo13][ScaLAPACK] [BDDrsquo11][BDKrsquo13] L nP12

                                                                                  Non-Sym Eig [BDDrsquo11] [BDDrsquo11] BW P12 L n

                                                                                  Attaining with extra memory 25D M=(cn2P)

                                                                                  Attaining the Lower bounds Parallel 2DM=(n2P)(Ignoring poly-log(P) factors words = ( n2 P12) messages = (P12)

                                                                                  Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                  ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                  ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                  bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                  bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                  Avoiding Communication in Iterative Linear Algebra

                                                                                  bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

                                                                                  bull Conjugate Gradients (CG) GMRES Lanczos Arnoldi hellip bull Goal minimize communication

                                                                                  ndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

                                                                                  bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

                                                                                  ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

                                                                                  bull Lots of speed up possible (modeled and measured)ndash Price some redundant computationndash Challenges Poor partitioning Preconditioning Num Stability

                                                                                  75

                                                                                  Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                  ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                  ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                  bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                  bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                  Example The Difficulty of Tuning SpMV

                                                                                  bull n = 21200bull nnz = 15 M

                                                                                  bull Source NASA structural analysis problem (raefsky)

                                                                                  77

                                                                                  Example The Difficulty of Tuning

                                                                                  bull n = 21200bull nnz = 15 M

                                                                                  bull Source NASA structural analysis problem (raefsky)

                                                                                  bull 8x8 dense substructure exploit this to limit mem_refs

                                                                                  78

                                                                                  Speedups on Itanium 2 The Need for Search

                                                                                  Reference

                                                                                  Best 4x2

                                                                                  Mflops

                                                                                  Mflops

                                                                                  79

                                                                                  Register Profile Itanium 2

                                                                                  190 Mflops

                                                                                  1190 Mflops

                                                                                  80

                                                                                  Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16

                                                                                  Itanium 2 - 33Itanium 1 - 8

                                                                                  252 Mflops

                                                                                  122 Mflops

                                                                                  820 Mflops

                                                                                  459 Mflops

                                                                                  247 Mflops

                                                                                  107 Mflops

                                                                                  12 Gflops

                                                                                  190 Mflops

                                                                                  Another example of tuning challenges for SpMV

                                                                                  bull Ex11 matrix (fluid flow)

                                                                                  bull More complicated non-zero structure in general

                                                                                  bull N = 16614bull NNZ = 11M

                                                                                  82

                                                                                  Zoom in to top corner

                                                                                  bull More complicated non-zero structure in general

                                                                                  bull N = 16614bull NNZ = 11M

                                                                                  83

                                                                                  3x3 blocks look natural buthellip

                                                                                  bull Example 3x3 blockingndash Logical grid of 3x3 cells

                                                                                  bull But would lead to lots of ldquofill-inrdquo

                                                                                  84

                                                                                  Extra Work Can Improve Efficiency

                                                                                  bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

                                                                                  bull On Pentium III 15x speedupndash Actual mflop rate 152 = 225 higher

                                                                                  85

                                                                                  Source Accelerator Cavity Design Problem (Ko via Husbands)

                                                                                  86

                                                                                  100x100 Submatrix Along Diagonal

                                                                                  Summer School Lecture 787

                                                                                  Post-RCM Reordering

                                                                                  88

                                                                                  Effect of Combined RCM+TSP Reordering

                                                                                  Before Green + RedAfter Green + Blue

                                                                                  Summer School Lecture 789

                                                                                  2x speedups on Pentium 4 Power 4 hellip

                                                                                  Summary of Other Performance Optimizations

                                                                                  bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

                                                                                  bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

                                                                                  bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

                                                                                  90

                                                                                  Optimized Sparse Kernel Interface - OSKI

                                                                                  bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

                                                                                  bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

                                                                                  bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

                                                                                  software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

                                                                                  91

                                                                                  Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                  ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                  ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                  bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                  bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                  93

                                                                                  Example Classical Conjugate Gradient (CG)

                                                                                  SpMVs and dot products require communication in

                                                                                  each iteration

                                                                                  via CA Matrix Powers Kernel

                                                                                  Global reduction to compute G

                                                                                  94

                                                                                  Example CA-Conjugate Gradient

                                                                                  Local computations within inner loop require

                                                                                  no communication

                                                                                  Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                  ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                  ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                  bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                  bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                  96

                                                                                  Slower convergence due

                                                                                  to roundoff

                                                                                  Loss of accuracy due to roundoff

                                                                                  At s = 16 monomial basis is rank deficient Method breaks down

                                                                                  Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

                                                                                  CA-CG (monomial)CG

                                                                                  machine precision

                                                                                  97

                                                                                  Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                  ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                  ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                  bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                  bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                  What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

                                                                                  bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

                                                                                  bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

                                                                                  matrices

                                                                                  Explicit (O(nnz)) Implicit (o(nnz))

                                                                                  Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

                                                                                  Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

                                                                                  Indices

                                                                                  Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                  ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                  ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                  bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                  bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                  101

                                                                                  bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

                                                                                  ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

                                                                                  demanded by customers (construction engineers) otherwise they donrsquot believe results

                                                                                  ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

                                                                                  ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

                                                                                  Reproducible Floating Point Computation

                                                                                  Absolute Error for Random Vectors

                                                                                  Same magnitude opposite signs

                                                                                  Intel MKL non-reproducibility

                                                                                  Relative Error for Orthogonal vectors

                                                                                  Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

                                                                                  Sign notreproducible

                                                                                  103

                                                                                  bull Consider summation or dot productbull Goals

                                                                                  1 Same answer independent of layout processors order of summands

                                                                                  2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

                                                                                  bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

                                                                                  GoalsApproaches for Reproducibility

                                                                                  104

                                                                                  Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

                                                                                  Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

                                                                                  Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

                                                                                  bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

                                                                                  Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

                                                                                  bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

                                                                                  Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

                                                                                  Instruments NEC Nokia NVIDIA Samsung Oracle

                                                                                  bull bebopcsberkeleyedu

                                                                                  Summary

                                                                                  Donrsquot Communichellip

                                                                                  106

                                                                                  Time to redesign all linear algebra n-body hellip algorithms and software

                                                                                  (and compilers)

                                                                                  • Implementing Communication-Avoiding Algorithms
                                                                                  • Why avoid communication
                                                                                  • Goals
                                                                                  • Outline
                                                                                  • Outline (2)
                                                                                  • Lower bound for all ldquon3-likerdquo linear algebra
                                                                                  • Lower bound for all ldquon3-likerdquo linear algebra (2)
                                                                                  • Lower bound for all ldquon3-likerdquo linear algebra (3)
                                                                                  • Limits to parallel scaling (12)
                                                                                  • Limits to parallel scaling (22)
                                                                                  • Can we attain these lower bounds
                                                                                  • Outline (3)
                                                                                  • 25D Matrix Multiplication
                                                                                  • 25D Matrix Multiplication (2)
                                                                                  • 25D Matmul on BGP 16K nodes 64K cores (2)
                                                                                  • Perfect Strong Scaling ndash in Time and Energy (12)
                                                                                  • Perfect Strong Scaling ndash in Time and Energy (22)
                                                                                  • Handling Heterogeneity
                                                                                  • Application to Tensor Contractions
                                                                                  • C(ijk) = Σm A(ijm)B(mk)
                                                                                  • Application to Tensor Contractions (2)
                                                                                  • Communication Lower Bounds for Strassen-like matmul algorithms
                                                                                  • vs
                                                                                  • Slide 26
                                                                                  • Strassen-like beyond matmul
                                                                                  • Cache and Network Oblivious Algorithms
                                                                                  • CARMA Performance Distributed Memory
                                                                                  • CARMA Performance Distributed Memory (2)
                                                                                  • CARMA Performance Shared Memory
                                                                                  • CARMA Performance Shared Memory (2)
                                                                                  • Why is CARMA Faster in Shared Memory
                                                                                  • Outline (4)
                                                                                  • One-sided Factorizations (LU QR) so far
                                                                                  • TSQR An Architecture-Dependent Algorithm
                                                                                  • Back to LU Using similar idea for TSLU as TSQR Use reduction
                                                                                  • Minimizing Communication in TSLU
                                                                                  • Making TSLU Numerically Stable
                                                                                  • Stability of LU using TSLU CALU
                                                                                  • Why is stability of TSLU just a ldquoThmrdquo
                                                                                  • Fixing TSLU
                                                                                  • 2D CALU with Tournament Pivoting
                                                                                  • 25D CALU with Tournament Pivoting (c=4 copies)
                                                                                  • Exascale Machine Parameters Source DOE Exascale Workshop
                                                                                  • Exascale predicted speedups for Gaussian Elimination 2D CA
                                                                                  • 25D vs 2D LU With and Without Pivoting
                                                                                  • Other CA algorithms for Ax=b least squares(13)
                                                                                  • Other CA algorithms for Ax=b least squares (23)
                                                                                  • Other CA algorithms for Ax=b least squares (33)
                                                                                  • Outline (5)
                                                                                  • What about sparse matrices (13)
                                                                                  • Performance of 25D APSP using Kleene
                                                                                  • What about sparse matrices (23)
                                                                                  • What about sparse matrices (33)
                                                                                  • Outline (6)
                                                                                  • Symmetric Eigenproblem and SVD
                                                                                  • Slide 58
                                                                                  • Slide 59
                                                                                  • Slide 60
                                                                                  • Slide 61
                                                                                  • Slide 62
                                                                                  • Slide 63
                                                                                  • Slide 64
                                                                                  • Slide 65
                                                                                  • Slide 66
                                                                                  • Slide 67
                                                                                  • Slide 68
                                                                                  • Conventional vs CA - SBR
                                                                                  • Speedups of Sym Band Reduction vs DSBTRD
                                                                                  • Nonsymmetric Eigenproblem
                                                                                  • Attaining the Lower bounds Sequential
                                                                                  • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
                                                                                  • Outline (7)
                                                                                  • Avoiding Communication in Iterative Linear Algebra
                                                                                  • Outline (8)
                                                                                  • Example The Difficulty of Tuning SpMV
                                                                                  • Example The Difficulty of Tuning
                                                                                  • Speedups on Itanium 2 The Need for Search
                                                                                  • Register Profile Itanium 2
                                                                                  • Register Profiles IBM and Intel IA-64
                                                                                  • Another example of tuning challenges for SpMV
                                                                                  • Zoom in to top corner
                                                                                  • 3x3 blocks look natural buthellip
                                                                                  • Extra Work Can Improve Efficiency
                                                                                  • Slide 86
                                                                                  • Slide 87
                                                                                  • Slide 88
                                                                                  • Slide 89
                                                                                  • Summary of Other Performance Optimizations
                                                                                  • Optimized Sparse Kernel Interface - OSKI
                                                                                  • Outline (9)
                                                                                  • Example Classical Conjugate Gradient (CG)
                                                                                  • Example CA-Conjugate Gradient
                                                                                  • Outline (10)
                                                                                  • Slide 96
                                                                                  • Slide 97
                                                                                  • Outline (11)
                                                                                  • What is a ldquosparse matrixrdquo
                                                                                  • Outline (12)
                                                                                  • Reproducible Floating Point Computation
                                                                                  • Intel MKL non-reproducibility
                                                                                  • GoalsApproaches for Reproducibility
                                                                                  • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
                                                                                  • Collaborators and Supporters
                                                                                  • Summary

                                                                                    25D CALU with Tournament Pivoting (c=4 copies)

                                                                                    44

                                                                                    Exascale Machine ParametersSource DOE Exascale Workshop

                                                                                    bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

                                                                                    Exascale predicted speedupsfor Gaussian Elimination

                                                                                    2D CA-LU vs ScaLAPACK-LU

                                                                                    log2 (p)

                                                                                    log 2

                                                                                    (n2 p

                                                                                    ) =

                                                                                    log 2

                                                                                    (mem

                                                                                    ory_

                                                                                    per_

                                                                                    proc

                                                                                    )

                                                                                    Up to 29x

                                                                                    25D vs 2D LUWith and Without Pivoting

                                                                                    Other CA algorithms for Ax=b least squares(13)

                                                                                    bull A symmetric and indefinitendash Seek factorization that retains symmetry PAPT = LDLT D

                                                                                    ldquosimplerdquobull Save frac12 flops preserve inertia

                                                                                    ndash Usual approach Bunch-Kaufmanbull D block diagonal with 1x1 and 2x2 blocksbull Pivot search down column along row (lots of communication)

                                                                                    ndash Alternative Aasenbull D = tridiagonal = Tbull Two steps

                                                                                    ndash PAPT = LTLT where T is banded using TSLU

                                                                                    48

                                                                                    0 0

                                                                                    0

                                                                                    0 0

                                                                                    0

                                                                                    0

                                                                                    hellip

                                                                                    hellip

                                                                                    ndash Solvefactor narrow band problem with Tbull Up to 28x faster than MKL Best Paper at IPDPSrsquo13

                                                                                    Other CA algorithms for Ax=b least squares (23)bull Minimizing bandwidth and latency for sequential GEPP

                                                                                    ndash So far could not do partial pivoting and minimize messages just words

                                                                                    ndash Challengebull Column layout good for choosing pivots bad for matmulbull Blocked layout good for matmul bad for choosing pivots

                                                                                    ndash Solution use both layouts switching between thembull ldquoShape Morphing LUrdquo or SMLU

                                                                                    49

                                                                                    bull func factor(A) if A has 1 column update it else factor(left half of A)

                                                                                    update right half of A

                                                                                    factor(right half of A)

                                                                                    bull Words = O(n3M12)

                                                                                    bull Messages = O(n3M)

                                                                                    bull func factor(A) if A has 1 column update it else factor(left half of A) reshape to recursive block format update right half of A reshape to columnwise format factor(right half of A)

                                                                                    bull Words = O(n3M12)

                                                                                    bull Messages = O(n3M32)

                                                                                    Other CA algorithms for Ax=b least squares (33)bull Need for pivoting arises beyond LU in QR

                                                                                    ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)

                                                                                    ndash Usual approach like Partial Pivotingbull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound

                                                                                    ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two

                                                                                    groups of b columns either using usual approach or something better (GuEisenstat)

                                                                                    bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix

                                                                                    ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 50

                                                                                    Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                    ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                    ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                    bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                    bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                    What about sparse matrices (13)

                                                                                    bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

                                                                                    bull But canrsquot reorder outer loop for 25D need another ideabull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = AB

                                                                                    ndash Dependencies ok 25D works just different semiringbull Kleenersquos Algorithm

                                                                                    52

                                                                                    for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

                                                                                    D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

                                                                                    Performance of 25D APSP using Kleene

                                                                                    53

                                                                                    Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

                                                                                    62xspeedup

                                                                                    2x speedup

                                                                                    What about sparse matrices (23)

                                                                                    bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of G(A)

                                                                                    have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

                                                                                    bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering for

                                                                                    2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

                                                                                    bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

                                                                                    multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

                                                                                    separators)

                                                                                    54

                                                                                    What about sparse matrices (33)

                                                                                    bull If matrix stays very sparse lower bound unattainable new one bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

                                                                                    data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

                                                                                    bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

                                                                                    Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

                                                                                    bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

                                                                                    along dimensions most likely to minimize cost55

                                                                                    Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                    ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                    ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                    bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                    bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                    Symmetric Eigenproblem and SVD

                                                                                    bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd

                                                                                    bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12

                                                                                    ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea

                                                                                    b+1

                                                                                    b+1

                                                                                    Successive Band Reduction (BischofLangSun)

                                                                                    1

                                                                                    b+1

                                                                                    b+1

                                                                                    d+1

                                                                                    c

                                                                                    Successive Band Reduction (BischofLangSun)

                                                                                    b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                    1Q1

                                                                                    b+1

                                                                                    b+1

                                                                                    d+1

                                                                                    c

                                                                                    b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                    Successive Band Reduction (BischofLangSun)

                                                                                    12

                                                                                    Q1

                                                                                    b+1

                                                                                    b+1

                                                                                    d+1

                                                                                    d+c

                                                                                    d+c

                                                                                    c

                                                                                    b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                    Successive Band Reduction (BischofLangSun)

                                                                                    1

                                                                                    12

                                                                                    Q1

                                                                                    Q1T

                                                                                    b+1

                                                                                    b+1

                                                                                    d+1

                                                                                    d+1

                                                                                    cd+c

                                                                                    d+c

                                                                                    c

                                                                                    b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                    Successive Band Reduction (BischofLangSun)

                                                                                    1

                                                                                    1

                                                                                    2

                                                                                    2Q1

                                                                                    Q1T

                                                                                    b+1

                                                                                    b+1

                                                                                    d+1

                                                                                    d+1

                                                                                    cd+c

                                                                                    d+c

                                                                                    d+c

                                                                                    d+c

                                                                                    c

                                                                                    b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                    Successive Band Reduction (BischofLangSun)

                                                                                    1

                                                                                    1

                                                                                    2

                                                                                    2

                                                                                    3

                                                                                    3

                                                                                    Q1

                                                                                    Q1T

                                                                                    Q2

                                                                                    Q2T

                                                                                    b+1

                                                                                    b+1

                                                                                    d+1

                                                                                    d+1

                                                                                    d+c

                                                                                    d+c

                                                                                    d+c

                                                                                    d+c

                                                                                    c

                                                                                    c

                                                                                    b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                    Successive Band Reduction (BischofLangSun)

                                                                                    1

                                                                                    1

                                                                                    2

                                                                                    2

                                                                                    3

                                                                                    3

                                                                                    4

                                                                                    4

                                                                                    Q1

                                                                                    Q1T

                                                                                    Q2

                                                                                    Q2T

                                                                                    Q3

                                                                                    Q3T

                                                                                    b+1

                                                                                    b+1

                                                                                    d+1

                                                                                    d+1

                                                                                    d+c

                                                                                    d+c

                                                                                    d+c

                                                                                    d+c

                                                                                    c

                                                                                    c

                                                                                    b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                    Successive Band Reduction (BischofLangSun)

                                                                                    1

                                                                                    1

                                                                                    2

                                                                                    2

                                                                                    3

                                                                                    3

                                                                                    4

                                                                                    4

                                                                                    5

                                                                                    5

                                                                                    Q1

                                                                                    Q1T

                                                                                    Q2

                                                                                    Q2T

                                                                                    Q3

                                                                                    Q3T

                                                                                    Q4

                                                                                    Q4T

                                                                                    b+1

                                                                                    b+1

                                                                                    d+1

                                                                                    d+1

                                                                                    c

                                                                                    c

                                                                                    d+c

                                                                                    d+c

                                                                                    d+c

                                                                                    d+c

                                                                                    b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                    Successive Band Reduction (BischofLangSun)

                                                                                    1

                                                                                    1

                                                                                    2

                                                                                    2

                                                                                    3

                                                                                    3

                                                                                    4

                                                                                    4

                                                                                    5

                                                                                    5

                                                                                    Q5T

                                                                                    Q1

                                                                                    Q1T

                                                                                    Q2

                                                                                    Q2T

                                                                                    Q3

                                                                                    Q3T

                                                                                    Q5

                                                                                    Q4

                                                                                    Q4T

                                                                                    b+1

                                                                                    b+1

                                                                                    d+1

                                                                                    d+1

                                                                                    c

                                                                                    c

                                                                                    d+c

                                                                                    d+c

                                                                                    d+c

                                                                                    d+c

                                                                                    b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                    Successive Band Reduction (BischofLangSun)

                                                                                    1

                                                                                    1

                                                                                    2

                                                                                    2

                                                                                    3

                                                                                    3

                                                                                    4

                                                                                    4

                                                                                    5

                                                                                    5

                                                                                    6

                                                                                    6

                                                                                    Q5T

                                                                                    Q1

                                                                                    Q1T

                                                                                    Q2

                                                                                    Q2T

                                                                                    Q3

                                                                                    Q3T

                                                                                    Q5

                                                                                    Q4

                                                                                    Q4T

                                                                                    b+1

                                                                                    b+1

                                                                                    d+1

                                                                                    d+1

                                                                                    c

                                                                                    c

                                                                                    d+c

                                                                                    d+c

                                                                                    d+c

                                                                                    d+c

                                                                                    b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                    Successive Band Reduction (BischofLangSun)

                                                                                    Conventional vs CA - SBR

                                                                                    Conventional Communication-Avoiding

                                                                                    Touch all data 4 times Touch all data once

                                                                                    >
                                                                                    >

                                                                                    Speedups of Sym Band Reductionvs DSBTRD

                                                                                    bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

                                                                                    bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

                                                                                    bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

                                                                                    bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

                                                                                    bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

                                                                                    Nonsymmetric Eigenproblem

                                                                                    bull No apparent way to modify standard algorithmbull Instead Spectral Divide-and-Conquer

                                                                                    ndash Find orthogonal matrix Q whose leading columns span an invariant subspace of A

                                                                                    ndash QTAQ will be block upper triangularndash Apply recursively to A11 A22

                                                                                    ndash Depends on randomization1 Randomized Rank Revealing QR decomposition2 Randomized location to try splitting spectrum

                                                                                    A11 A12

                                                                                    ε A22

                                                                                    Attaining the Lower bounds SequentialLegend[Existing][Ours][Math-Lib][Random]

                                                                                    Two Levels Memory Hierarchy

                                                                                    Words Messages Words Messages

                                                                                    BLAS-3 [FLPRrsquo99][BDLSTrsquo13][MKL etc] [FLPRrsquo99][BDLSTrsquo13][MKL etc]

                                                                                    Cholesky[Grsquo97][APrsquo00]

                                                                                    [LAPACK][BDHSrsquo09]

                                                                                    [Grsquo97][APrsquo00][BDHSrsquo09] [Grsquo97][APrsquo00][BDHSrsquo09]

                                                                                    Sym Indefinite [BBDDDPSTYrsquo13] [BBDDDPSTYrsquo13]

                                                                                    LU[Grsquo97][Trsquo97]

                                                                                    [GDXrsquo11][BDLSTrsquo13]

                                                                                    [GDXrsquo11][BDLSTrsquo13]

                                                                                    [Grsquo97][Trsquo97] [BDLSTrsquo13] [BDLSTrsquo13]

                                                                                    QR[EGrsquo98][FWrsquo03]

                                                                                    [DGHLrsquo12][BDLSTrsquo13]

                                                                                    [FWrsquo03][DGHLrsquo12][BDLSTrsquo13]

                                                                                    [EGrsquo98][FWrsquo03][BDLSTrsquo13]

                                                                                    [FWrsquo03][BDLSTrsquo13]

                                                                                    Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

                                                                                    Sym Eig amp SVD [BDDrsquo11][BDKrsquo13] [BDDrsquo11]

                                                                                    Non Sym Eig [BDDrsquo11] [BDDrsquo11]

                                                                                    Legend[Existing][Ours][Math-Lib][Random]

                                                                                    Words (BW) Messages (L) Saving factor

                                                                                    BLAS-3 [AGZrsquo94][MTrsquo99][ScaLAPACK][Crsquo69][vGWrsquo97][SDrsquo11] L nP12

                                                                                    Cholesky [ScaLAPACK][Trsquo99][SDrsquo11] L nP12

                                                                                    Sym Indefinite [BBDDDPSTYrsquo13][ScaLAPACK] [BBDDDPSTYrsquo13] L nP12

                                                                                    LU [ScaLAPACK][GDXrsquo11][Trsquo99][SDrsquo11] [GDXrsquo11][Trsquo99][SDrsquo11] L nP12

                                                                                    QR [ScaLAPACK][DGHLrsquo12] [Trsquo99] [DGHLrsquo12][Trsquo99] L nP12

                                                                                    Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

                                                                                    Sym Eig amp SVD [BDDrsquo11][BDKrsquo13][ScaLAPACK] [BDDrsquo11][BDKrsquo13] L nP12

                                                                                    Non-Sym Eig [BDDrsquo11] [BDDrsquo11] BW P12 L n

                                                                                    Attaining with extra memory 25D M=(cn2P)

                                                                                    Attaining the Lower bounds Parallel 2DM=(n2P)(Ignoring poly-log(P) factors words = ( n2 P12) messages = (P12)

                                                                                    Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                    ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                    ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                    bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                    bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                    Avoiding Communication in Iterative Linear Algebra

                                                                                    bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

                                                                                    bull Conjugate Gradients (CG) GMRES Lanczos Arnoldi hellip bull Goal minimize communication

                                                                                    ndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

                                                                                    bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

                                                                                    ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

                                                                                    bull Lots of speed up possible (modeled and measured)ndash Price some redundant computationndash Challenges Poor partitioning Preconditioning Num Stability

                                                                                    75

                                                                                    Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                    ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                    ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                    bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                    bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                    Example The Difficulty of Tuning SpMV

                                                                                    bull n = 21200bull nnz = 15 M

                                                                                    bull Source NASA structural analysis problem (raefsky)

                                                                                    77

                                                                                    Example The Difficulty of Tuning

                                                                                    bull n = 21200bull nnz = 15 M

                                                                                    bull Source NASA structural analysis problem (raefsky)

                                                                                    bull 8x8 dense substructure exploit this to limit mem_refs

                                                                                    78

                                                                                    Speedups on Itanium 2 The Need for Search

                                                                                    Reference

                                                                                    Best 4x2

                                                                                    Mflops

                                                                                    Mflops

                                                                                    79

                                                                                    Register Profile Itanium 2

                                                                                    190 Mflops

                                                                                    1190 Mflops

                                                                                    80

                                                                                    Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16

                                                                                    Itanium 2 - 33Itanium 1 - 8

                                                                                    252 Mflops

                                                                                    122 Mflops

                                                                                    820 Mflops

                                                                                    459 Mflops

                                                                                    247 Mflops

                                                                                    107 Mflops

                                                                                    12 Gflops

                                                                                    190 Mflops

                                                                                    Another example of tuning challenges for SpMV

                                                                                    bull Ex11 matrix (fluid flow)

                                                                                    bull More complicated non-zero structure in general

                                                                                    bull N = 16614bull NNZ = 11M

                                                                                    82

                                                                                    Zoom in to top corner

                                                                                    bull More complicated non-zero structure in general

                                                                                    bull N = 16614bull NNZ = 11M

                                                                                    83

                                                                                    3x3 blocks look natural buthellip

                                                                                    bull Example 3x3 blockingndash Logical grid of 3x3 cells

                                                                                    bull But would lead to lots of ldquofill-inrdquo

                                                                                    84

                                                                                    Extra Work Can Improve Efficiency

                                                                                    bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

                                                                                    bull On Pentium III 15x speedupndash Actual mflop rate 152 = 225 higher

                                                                                    85

                                                                                    Source Accelerator Cavity Design Problem (Ko via Husbands)

                                                                                    86

                                                                                    100x100 Submatrix Along Diagonal

                                                                                    Summer School Lecture 787

                                                                                    Post-RCM Reordering

                                                                                    88

                                                                                    Effect of Combined RCM+TSP Reordering

                                                                                    Before Green + RedAfter Green + Blue

                                                                                    Summer School Lecture 789

                                                                                    2x speedups on Pentium 4 Power 4 hellip

                                                                                    Summary of Other Performance Optimizations

                                                                                    bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

                                                                                    bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

                                                                                    bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

                                                                                    90

                                                                                    Optimized Sparse Kernel Interface - OSKI

                                                                                    bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

                                                                                    bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

                                                                                    bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

                                                                                    software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

                                                                                    91

                                                                                    Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                    ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                    ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                    bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                    bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                    93

                                                                                    Example Classical Conjugate Gradient (CG)

                                                                                    SpMVs and dot products require communication in

                                                                                    each iteration

                                                                                    via CA Matrix Powers Kernel

                                                                                    Global reduction to compute G

                                                                                    94

                                                                                    Example CA-Conjugate Gradient

                                                                                    Local computations within inner loop require

                                                                                    no communication

                                                                                    Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                    ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                    ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                    bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                    bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                    96

                                                                                    Slower convergence due

                                                                                    to roundoff

                                                                                    Loss of accuracy due to roundoff

                                                                                    At s = 16 monomial basis is rank deficient Method breaks down

                                                                                    Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

                                                                                    CA-CG (monomial)CG

                                                                                    machine precision

                                                                                    97

                                                                                    Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                    ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                    ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                    bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                    bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                    What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

                                                                                    bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

                                                                                    bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

                                                                                    matrices

                                                                                    Explicit (O(nnz)) Implicit (o(nnz))

                                                                                    Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

                                                                                    Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

                                                                                    Indices

                                                                                    Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                    ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                    ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                    bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                    bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                    101

                                                                                    bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

                                                                                    ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

                                                                                    demanded by customers (construction engineers) otherwise they donrsquot believe results

                                                                                    ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

                                                                                    ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

                                                                                    Reproducible Floating Point Computation

                                                                                    Absolute Error for Random Vectors

                                                                                    Same magnitude opposite signs

                                                                                    Intel MKL non-reproducibility

                                                                                    Relative Error for Orthogonal vectors

                                                                                    Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

                                                                                    Sign notreproducible

                                                                                    103

                                                                                    bull Consider summation or dot productbull Goals

                                                                                    1 Same answer independent of layout processors order of summands

                                                                                    2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

                                                                                    bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

                                                                                    GoalsApproaches for Reproducibility

                                                                                    104

                                                                                    Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

                                                                                    Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

                                                                                    Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

                                                                                    bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

                                                                                    Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

                                                                                    bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

                                                                                    Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

                                                                                    Instruments NEC Nokia NVIDIA Samsung Oracle

                                                                                    bull bebopcsberkeleyedu

                                                                                    Summary

                                                                                    Donrsquot Communichellip

                                                                                    106

                                                                                    Time to redesign all linear algebra n-body hellip algorithms and software

                                                                                    (and compilers)

                                                                                    • Implementing Communication-Avoiding Algorithms
                                                                                    • Why avoid communication
                                                                                    • Goals
                                                                                    • Outline
                                                                                    • Outline (2)
                                                                                    • Lower bound for all ldquon3-likerdquo linear algebra
                                                                                    • Lower bound for all ldquon3-likerdquo linear algebra (2)
                                                                                    • Lower bound for all ldquon3-likerdquo linear algebra (3)
                                                                                    • Limits to parallel scaling (12)
                                                                                    • Limits to parallel scaling (22)
                                                                                    • Can we attain these lower bounds
                                                                                    • Outline (3)
                                                                                    • 25D Matrix Multiplication
                                                                                    • 25D Matrix Multiplication (2)
                                                                                    • 25D Matmul on BGP 16K nodes 64K cores (2)
                                                                                    • Perfect Strong Scaling ndash in Time and Energy (12)
                                                                                    • Perfect Strong Scaling ndash in Time and Energy (22)
                                                                                    • Handling Heterogeneity
                                                                                    • Application to Tensor Contractions
                                                                                    • C(ijk) = Σm A(ijm)B(mk)
                                                                                    • Application to Tensor Contractions (2)
                                                                                    • Communication Lower Bounds for Strassen-like matmul algorithms
                                                                                    • vs
                                                                                    • Slide 26
                                                                                    • Strassen-like beyond matmul
                                                                                    • Cache and Network Oblivious Algorithms
                                                                                    • CARMA Performance Distributed Memory
                                                                                    • CARMA Performance Distributed Memory (2)
                                                                                    • CARMA Performance Shared Memory
                                                                                    • CARMA Performance Shared Memory (2)
                                                                                    • Why is CARMA Faster in Shared Memory
                                                                                    • Outline (4)
                                                                                    • One-sided Factorizations (LU QR) so far
                                                                                    • TSQR An Architecture-Dependent Algorithm
                                                                                    • Back to LU Using similar idea for TSLU as TSQR Use reduction
                                                                                    • Minimizing Communication in TSLU
                                                                                    • Making TSLU Numerically Stable
                                                                                    • Stability of LU using TSLU CALU
                                                                                    • Why is stability of TSLU just a ldquoThmrdquo
                                                                                    • Fixing TSLU
                                                                                    • 2D CALU with Tournament Pivoting
                                                                                    • 25D CALU with Tournament Pivoting (c=4 copies)
                                                                                    • Exascale Machine Parameters Source DOE Exascale Workshop
                                                                                    • Exascale predicted speedups for Gaussian Elimination 2D CA
                                                                                    • 25D vs 2D LU With and Without Pivoting
                                                                                    • Other CA algorithms for Ax=b least squares(13)
                                                                                    • Other CA algorithms for Ax=b least squares (23)
                                                                                    • Other CA algorithms for Ax=b least squares (33)
                                                                                    • Outline (5)
                                                                                    • What about sparse matrices (13)
                                                                                    • Performance of 25D APSP using Kleene
                                                                                    • What about sparse matrices (23)
                                                                                    • What about sparse matrices (33)
                                                                                    • Outline (6)
                                                                                    • Symmetric Eigenproblem and SVD
                                                                                    • Slide 58
                                                                                    • Slide 59
                                                                                    • Slide 60
                                                                                    • Slide 61
                                                                                    • Slide 62
                                                                                    • Slide 63
                                                                                    • Slide 64
                                                                                    • Slide 65
                                                                                    • Slide 66
                                                                                    • Slide 67
                                                                                    • Slide 68
                                                                                    • Conventional vs CA - SBR
                                                                                    • Speedups of Sym Band Reduction vs DSBTRD
                                                                                    • Nonsymmetric Eigenproblem
                                                                                    • Attaining the Lower bounds Sequential
                                                                                    • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
                                                                                    • Outline (7)
                                                                                    • Avoiding Communication in Iterative Linear Algebra
                                                                                    • Outline (8)
                                                                                    • Example The Difficulty of Tuning SpMV
                                                                                    • Example The Difficulty of Tuning
                                                                                    • Speedups on Itanium 2 The Need for Search
                                                                                    • Register Profile Itanium 2
                                                                                    • Register Profiles IBM and Intel IA-64
                                                                                    • Another example of tuning challenges for SpMV
                                                                                    • Zoom in to top corner
                                                                                    • 3x3 blocks look natural buthellip
                                                                                    • Extra Work Can Improve Efficiency
                                                                                    • Slide 86
                                                                                    • Slide 87
                                                                                    • Slide 88
                                                                                    • Slide 89
                                                                                    • Summary of Other Performance Optimizations
                                                                                    • Optimized Sparse Kernel Interface - OSKI
                                                                                    • Outline (9)
                                                                                    • Example Classical Conjugate Gradient (CG)
                                                                                    • Example CA-Conjugate Gradient
                                                                                    • Outline (10)
                                                                                    • Slide 96
                                                                                    • Slide 97
                                                                                    • Outline (11)
                                                                                    • What is a ldquosparse matrixrdquo
                                                                                    • Outline (12)
                                                                                    • Reproducible Floating Point Computation
                                                                                    • Intel MKL non-reproducibility
                                                                                    • GoalsApproaches for Reproducibility
                                                                                    • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
                                                                                    • Collaborators and Supporters
                                                                                    • Summary

                                                                                      Exascale Machine ParametersSource DOE Exascale Workshop

                                                                                      bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

                                                                                      Exascale predicted speedupsfor Gaussian Elimination

                                                                                      2D CA-LU vs ScaLAPACK-LU

                                                                                      log2 (p)

                                                                                      log 2

                                                                                      (n2 p

                                                                                      ) =

                                                                                      log 2

                                                                                      (mem

                                                                                      ory_

                                                                                      per_

                                                                                      proc

                                                                                      )

                                                                                      Up to 29x

                                                                                      25D vs 2D LUWith and Without Pivoting

                                                                                      Other CA algorithms for Ax=b least squares(13)

                                                                                      bull A symmetric and indefinitendash Seek factorization that retains symmetry PAPT = LDLT D

                                                                                      ldquosimplerdquobull Save frac12 flops preserve inertia

                                                                                      ndash Usual approach Bunch-Kaufmanbull D block diagonal with 1x1 and 2x2 blocksbull Pivot search down column along row (lots of communication)

                                                                                      ndash Alternative Aasenbull D = tridiagonal = Tbull Two steps

                                                                                      ndash PAPT = LTLT where T is banded using TSLU

                                                                                      48

                                                                                      0 0

                                                                                      0

                                                                                      0 0

                                                                                      0

                                                                                      0

                                                                                      hellip

                                                                                      hellip

                                                                                      ndash Solvefactor narrow band problem with Tbull Up to 28x faster than MKL Best Paper at IPDPSrsquo13

                                                                                      Other CA algorithms for Ax=b least squares (23)bull Minimizing bandwidth and latency for sequential GEPP

                                                                                      ndash So far could not do partial pivoting and minimize messages just words

                                                                                      ndash Challengebull Column layout good for choosing pivots bad for matmulbull Blocked layout good for matmul bad for choosing pivots

                                                                                      ndash Solution use both layouts switching between thembull ldquoShape Morphing LUrdquo or SMLU

                                                                                      49

                                                                                      bull func factor(A) if A has 1 column update it else factor(left half of A)

                                                                                      update right half of A

                                                                                      factor(right half of A)

                                                                                      bull Words = O(n3M12)

                                                                                      bull Messages = O(n3M)

                                                                                      bull func factor(A) if A has 1 column update it else factor(left half of A) reshape to recursive block format update right half of A reshape to columnwise format factor(right half of A)

                                                                                      bull Words = O(n3M12)

                                                                                      bull Messages = O(n3M32)

                                                                                      Other CA algorithms for Ax=b least squares (33)bull Need for pivoting arises beyond LU in QR

                                                                                      ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)

                                                                                      ndash Usual approach like Partial Pivotingbull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound

                                                                                      ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two

                                                                                      groups of b columns either using usual approach or something better (GuEisenstat)

                                                                                      bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix

                                                                                      ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 50

                                                                                      Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                      ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                      ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                      bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                      bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                      What about sparse matrices (13)

                                                                                      bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

                                                                                      bull But canrsquot reorder outer loop for 25D need another ideabull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = AB

                                                                                      ndash Dependencies ok 25D works just different semiringbull Kleenersquos Algorithm

                                                                                      52

                                                                                      for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

                                                                                      D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

                                                                                      Performance of 25D APSP using Kleene

                                                                                      53

                                                                                      Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

                                                                                      62xspeedup

                                                                                      2x speedup

                                                                                      What about sparse matrices (23)

                                                                                      bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of G(A)

                                                                                      have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

                                                                                      bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering for

                                                                                      2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

                                                                                      bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

                                                                                      multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

                                                                                      separators)

                                                                                      54

                                                                                      What about sparse matrices (33)

                                                                                      bull If matrix stays very sparse lower bound unattainable new one bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

                                                                                      data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

                                                                                      bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

                                                                                      Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

                                                                                      bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

                                                                                      along dimensions most likely to minimize cost55

                                                                                      Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                      ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                      ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                      bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                      bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                      Symmetric Eigenproblem and SVD

                                                                                      bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd

                                                                                      bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12

                                                                                      ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea

                                                                                      b+1

                                                                                      b+1

                                                                                      Successive Band Reduction (BischofLangSun)

                                                                                      1

                                                                                      b+1

                                                                                      b+1

                                                                                      d+1

                                                                                      c

                                                                                      Successive Band Reduction (BischofLangSun)

                                                                                      b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                      1Q1

                                                                                      b+1

                                                                                      b+1

                                                                                      d+1

                                                                                      c

                                                                                      b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                      Successive Band Reduction (BischofLangSun)

                                                                                      12

                                                                                      Q1

                                                                                      b+1

                                                                                      b+1

                                                                                      d+1

                                                                                      d+c

                                                                                      d+c

                                                                                      c

                                                                                      b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                      Successive Band Reduction (BischofLangSun)

                                                                                      1

                                                                                      12

                                                                                      Q1

                                                                                      Q1T

                                                                                      b+1

                                                                                      b+1

                                                                                      d+1

                                                                                      d+1

                                                                                      cd+c

                                                                                      d+c

                                                                                      c

                                                                                      b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                      Successive Band Reduction (BischofLangSun)

                                                                                      1

                                                                                      1

                                                                                      2

                                                                                      2Q1

                                                                                      Q1T

                                                                                      b+1

                                                                                      b+1

                                                                                      d+1

                                                                                      d+1

                                                                                      cd+c

                                                                                      d+c

                                                                                      d+c

                                                                                      d+c

                                                                                      c

                                                                                      b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                      Successive Band Reduction (BischofLangSun)

                                                                                      1

                                                                                      1

                                                                                      2

                                                                                      2

                                                                                      3

                                                                                      3

                                                                                      Q1

                                                                                      Q1T

                                                                                      Q2

                                                                                      Q2T

                                                                                      b+1

                                                                                      b+1

                                                                                      d+1

                                                                                      d+1

                                                                                      d+c

                                                                                      d+c

                                                                                      d+c

                                                                                      d+c

                                                                                      c

                                                                                      c

                                                                                      b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                      Successive Band Reduction (BischofLangSun)

                                                                                      1

                                                                                      1

                                                                                      2

                                                                                      2

                                                                                      3

                                                                                      3

                                                                                      4

                                                                                      4

                                                                                      Q1

                                                                                      Q1T

                                                                                      Q2

                                                                                      Q2T

                                                                                      Q3

                                                                                      Q3T

                                                                                      b+1

                                                                                      b+1

                                                                                      d+1

                                                                                      d+1

                                                                                      d+c

                                                                                      d+c

                                                                                      d+c

                                                                                      d+c

                                                                                      c

                                                                                      c

                                                                                      b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                      Successive Band Reduction (BischofLangSun)

                                                                                      1

                                                                                      1

                                                                                      2

                                                                                      2

                                                                                      3

                                                                                      3

                                                                                      4

                                                                                      4

                                                                                      5

                                                                                      5

                                                                                      Q1

                                                                                      Q1T

                                                                                      Q2

                                                                                      Q2T

                                                                                      Q3

                                                                                      Q3T

                                                                                      Q4

                                                                                      Q4T

                                                                                      b+1

                                                                                      b+1

                                                                                      d+1

                                                                                      d+1

                                                                                      c

                                                                                      c

                                                                                      d+c

                                                                                      d+c

                                                                                      d+c

                                                                                      d+c

                                                                                      b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                      Successive Band Reduction (BischofLangSun)

                                                                                      1

                                                                                      1

                                                                                      2

                                                                                      2

                                                                                      3

                                                                                      3

                                                                                      4

                                                                                      4

                                                                                      5

                                                                                      5

                                                                                      Q5T

                                                                                      Q1

                                                                                      Q1T

                                                                                      Q2

                                                                                      Q2T

                                                                                      Q3

                                                                                      Q3T

                                                                                      Q5

                                                                                      Q4

                                                                                      Q4T

                                                                                      b+1

                                                                                      b+1

                                                                                      d+1

                                                                                      d+1

                                                                                      c

                                                                                      c

                                                                                      d+c

                                                                                      d+c

                                                                                      d+c

                                                                                      d+c

                                                                                      b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                      Successive Band Reduction (BischofLangSun)

                                                                                      1

                                                                                      1

                                                                                      2

                                                                                      2

                                                                                      3

                                                                                      3

                                                                                      4

                                                                                      4

                                                                                      5

                                                                                      5

                                                                                      6

                                                                                      6

                                                                                      Q5T

                                                                                      Q1

                                                                                      Q1T

                                                                                      Q2

                                                                                      Q2T

                                                                                      Q3

                                                                                      Q3T

                                                                                      Q5

                                                                                      Q4

                                                                                      Q4T

                                                                                      b+1

                                                                                      b+1

                                                                                      d+1

                                                                                      d+1

                                                                                      c

                                                                                      c

                                                                                      d+c

                                                                                      d+c

                                                                                      d+c

                                                                                      d+c

                                                                                      b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                      Successive Band Reduction (BischofLangSun)

                                                                                      Conventional vs CA - SBR

                                                                                      Conventional Communication-Avoiding

                                                                                      Touch all data 4 times Touch all data once

                                                                                      >
                                                                                      >

                                                                                      Speedups of Sym Band Reductionvs DSBTRD

                                                                                      bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

                                                                                      bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

                                                                                      bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

                                                                                      bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

                                                                                      bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

                                                                                      Nonsymmetric Eigenproblem

                                                                                      bull No apparent way to modify standard algorithmbull Instead Spectral Divide-and-Conquer

                                                                                      ndash Find orthogonal matrix Q whose leading columns span an invariant subspace of A

                                                                                      ndash QTAQ will be block upper triangularndash Apply recursively to A11 A22

                                                                                      ndash Depends on randomization1 Randomized Rank Revealing QR decomposition2 Randomized location to try splitting spectrum

                                                                                      A11 A12

                                                                                      ε A22

                                                                                      Attaining the Lower bounds SequentialLegend[Existing][Ours][Math-Lib][Random]

                                                                                      Two Levels Memory Hierarchy

                                                                                      Words Messages Words Messages

                                                                                      BLAS-3 [FLPRrsquo99][BDLSTrsquo13][MKL etc] [FLPRrsquo99][BDLSTrsquo13][MKL etc]

                                                                                      Cholesky[Grsquo97][APrsquo00]

                                                                                      [LAPACK][BDHSrsquo09]

                                                                                      [Grsquo97][APrsquo00][BDHSrsquo09] [Grsquo97][APrsquo00][BDHSrsquo09]

                                                                                      Sym Indefinite [BBDDDPSTYrsquo13] [BBDDDPSTYrsquo13]

                                                                                      LU[Grsquo97][Trsquo97]

                                                                                      [GDXrsquo11][BDLSTrsquo13]

                                                                                      [GDXrsquo11][BDLSTrsquo13]

                                                                                      [Grsquo97][Trsquo97] [BDLSTrsquo13] [BDLSTrsquo13]

                                                                                      QR[EGrsquo98][FWrsquo03]

                                                                                      [DGHLrsquo12][BDLSTrsquo13]

                                                                                      [FWrsquo03][DGHLrsquo12][BDLSTrsquo13]

                                                                                      [EGrsquo98][FWrsquo03][BDLSTrsquo13]

                                                                                      [FWrsquo03][BDLSTrsquo13]

                                                                                      Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

                                                                                      Sym Eig amp SVD [BDDrsquo11][BDKrsquo13] [BDDrsquo11]

                                                                                      Non Sym Eig [BDDrsquo11] [BDDrsquo11]

                                                                                      Legend[Existing][Ours][Math-Lib][Random]

                                                                                      Words (BW) Messages (L) Saving factor

                                                                                      BLAS-3 [AGZrsquo94][MTrsquo99][ScaLAPACK][Crsquo69][vGWrsquo97][SDrsquo11] L nP12

                                                                                      Cholesky [ScaLAPACK][Trsquo99][SDrsquo11] L nP12

                                                                                      Sym Indefinite [BBDDDPSTYrsquo13][ScaLAPACK] [BBDDDPSTYrsquo13] L nP12

                                                                                      LU [ScaLAPACK][GDXrsquo11][Trsquo99][SDrsquo11] [GDXrsquo11][Trsquo99][SDrsquo11] L nP12

                                                                                      QR [ScaLAPACK][DGHLrsquo12] [Trsquo99] [DGHLrsquo12][Trsquo99] L nP12

                                                                                      Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

                                                                                      Sym Eig amp SVD [BDDrsquo11][BDKrsquo13][ScaLAPACK] [BDDrsquo11][BDKrsquo13] L nP12

                                                                                      Non-Sym Eig [BDDrsquo11] [BDDrsquo11] BW P12 L n

                                                                                      Attaining with extra memory 25D M=(cn2P)

                                                                                      Attaining the Lower bounds Parallel 2DM=(n2P)(Ignoring poly-log(P) factors words = ( n2 P12) messages = (P12)

                                                                                      Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                      ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                      ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                      bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                      bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                      Avoiding Communication in Iterative Linear Algebra

                                                                                      bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

                                                                                      bull Conjugate Gradients (CG) GMRES Lanczos Arnoldi hellip bull Goal minimize communication

                                                                                      ndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

                                                                                      bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

                                                                                      ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

                                                                                      bull Lots of speed up possible (modeled and measured)ndash Price some redundant computationndash Challenges Poor partitioning Preconditioning Num Stability

                                                                                      75

                                                                                      Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                      ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                      ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                      bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                      bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                      Example The Difficulty of Tuning SpMV

                                                                                      bull n = 21200bull nnz = 15 M

                                                                                      bull Source NASA structural analysis problem (raefsky)

                                                                                      77

                                                                                      Example The Difficulty of Tuning

                                                                                      bull n = 21200bull nnz = 15 M

                                                                                      bull Source NASA structural analysis problem (raefsky)

                                                                                      bull 8x8 dense substructure exploit this to limit mem_refs

                                                                                      78

                                                                                      Speedups on Itanium 2 The Need for Search

                                                                                      Reference

                                                                                      Best 4x2

                                                                                      Mflops

                                                                                      Mflops

                                                                                      79

                                                                                      Register Profile Itanium 2

                                                                                      190 Mflops

                                                                                      1190 Mflops

                                                                                      80

                                                                                      Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16

                                                                                      Itanium 2 - 33Itanium 1 - 8

                                                                                      252 Mflops

                                                                                      122 Mflops

                                                                                      820 Mflops

                                                                                      459 Mflops

                                                                                      247 Mflops

                                                                                      107 Mflops

                                                                                      12 Gflops

                                                                                      190 Mflops

                                                                                      Another example of tuning challenges for SpMV

                                                                                      bull Ex11 matrix (fluid flow)

                                                                                      bull More complicated non-zero structure in general

                                                                                      bull N = 16614bull NNZ = 11M

                                                                                      82

                                                                                      Zoom in to top corner

                                                                                      bull More complicated non-zero structure in general

                                                                                      bull N = 16614bull NNZ = 11M

                                                                                      83

                                                                                      3x3 blocks look natural buthellip

                                                                                      bull Example 3x3 blockingndash Logical grid of 3x3 cells

                                                                                      bull But would lead to lots of ldquofill-inrdquo

                                                                                      84

                                                                                      Extra Work Can Improve Efficiency

                                                                                      bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

                                                                                      bull On Pentium III 15x speedupndash Actual mflop rate 152 = 225 higher

                                                                                      85

                                                                                      Source Accelerator Cavity Design Problem (Ko via Husbands)

                                                                                      86

                                                                                      100x100 Submatrix Along Diagonal

                                                                                      Summer School Lecture 787

                                                                                      Post-RCM Reordering

                                                                                      88

                                                                                      Effect of Combined RCM+TSP Reordering

                                                                                      Before Green + RedAfter Green + Blue

                                                                                      Summer School Lecture 789

                                                                                      2x speedups on Pentium 4 Power 4 hellip

                                                                                      Summary of Other Performance Optimizations

                                                                                      bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

                                                                                      bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

                                                                                      bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

                                                                                      90

                                                                                      Optimized Sparse Kernel Interface - OSKI

                                                                                      bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

                                                                                      bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

                                                                                      bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

                                                                                      software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

                                                                                      91

                                                                                      Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                      ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                      ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                      bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                      bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                      93

                                                                                      Example Classical Conjugate Gradient (CG)

                                                                                      SpMVs and dot products require communication in

                                                                                      each iteration

                                                                                      via CA Matrix Powers Kernel

                                                                                      Global reduction to compute G

                                                                                      94

                                                                                      Example CA-Conjugate Gradient

                                                                                      Local computations within inner loop require

                                                                                      no communication

                                                                                      Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                      ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                      ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                      bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                      bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                      96

                                                                                      Slower convergence due

                                                                                      to roundoff

                                                                                      Loss of accuracy due to roundoff

                                                                                      At s = 16 monomial basis is rank deficient Method breaks down

                                                                                      Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

                                                                                      CA-CG (monomial)CG

                                                                                      machine precision

                                                                                      97

                                                                                      Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                      ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                      ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                      bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                      bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                      What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

                                                                                      bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

                                                                                      bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

                                                                                      matrices

                                                                                      Explicit (O(nnz)) Implicit (o(nnz))

                                                                                      Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

                                                                                      Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

                                                                                      Indices

                                                                                      Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                      ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                      ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                      bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                      bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                      101

                                                                                      bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

                                                                                      ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

                                                                                      demanded by customers (construction engineers) otherwise they donrsquot believe results

                                                                                      ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

                                                                                      ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

                                                                                      Reproducible Floating Point Computation

                                                                                      Absolute Error for Random Vectors

                                                                                      Same magnitude opposite signs

                                                                                      Intel MKL non-reproducibility

                                                                                      Relative Error for Orthogonal vectors

                                                                                      Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

                                                                                      Sign notreproducible

                                                                                      103

                                                                                      bull Consider summation or dot productbull Goals

                                                                                      1 Same answer independent of layout processors order of summands

                                                                                      2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

                                                                                      bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

                                                                                      GoalsApproaches for Reproducibility

                                                                                      104

                                                                                      Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

                                                                                      Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

                                                                                      Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

                                                                                      bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

                                                                                      Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

                                                                                      bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

                                                                                      Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

                                                                                      Instruments NEC Nokia NVIDIA Samsung Oracle

                                                                                      bull bebopcsberkeleyedu

                                                                                      Summary

                                                                                      Donrsquot Communichellip

                                                                                      106

                                                                                      Time to redesign all linear algebra n-body hellip algorithms and software

                                                                                      (and compilers)

                                                                                      • Implementing Communication-Avoiding Algorithms
                                                                                      • Why avoid communication
                                                                                      • Goals
                                                                                      • Outline
                                                                                      • Outline (2)
                                                                                      • Lower bound for all ldquon3-likerdquo linear algebra
                                                                                      • Lower bound for all ldquon3-likerdquo linear algebra (2)
                                                                                      • Lower bound for all ldquon3-likerdquo linear algebra (3)
                                                                                      • Limits to parallel scaling (12)
                                                                                      • Limits to parallel scaling (22)
                                                                                      • Can we attain these lower bounds
                                                                                      • Outline (3)
                                                                                      • 25D Matrix Multiplication
                                                                                      • 25D Matrix Multiplication (2)
                                                                                      • 25D Matmul on BGP 16K nodes 64K cores (2)
                                                                                      • Perfect Strong Scaling ndash in Time and Energy (12)
                                                                                      • Perfect Strong Scaling ndash in Time and Energy (22)
                                                                                      • Handling Heterogeneity
                                                                                      • Application to Tensor Contractions
                                                                                      • C(ijk) = Σm A(ijm)B(mk)
                                                                                      • Application to Tensor Contractions (2)
                                                                                      • Communication Lower Bounds for Strassen-like matmul algorithms
                                                                                      • vs
                                                                                      • Slide 26
                                                                                      • Strassen-like beyond matmul
                                                                                      • Cache and Network Oblivious Algorithms
                                                                                      • CARMA Performance Distributed Memory
                                                                                      • CARMA Performance Distributed Memory (2)
                                                                                      • CARMA Performance Shared Memory
                                                                                      • CARMA Performance Shared Memory (2)
                                                                                      • Why is CARMA Faster in Shared Memory
                                                                                      • Outline (4)
                                                                                      • One-sided Factorizations (LU QR) so far
                                                                                      • TSQR An Architecture-Dependent Algorithm
                                                                                      • Back to LU Using similar idea for TSLU as TSQR Use reduction
                                                                                      • Minimizing Communication in TSLU
                                                                                      • Making TSLU Numerically Stable
                                                                                      • Stability of LU using TSLU CALU
                                                                                      • Why is stability of TSLU just a ldquoThmrdquo
                                                                                      • Fixing TSLU
                                                                                      • 2D CALU with Tournament Pivoting
                                                                                      • 25D CALU with Tournament Pivoting (c=4 copies)
                                                                                      • Exascale Machine Parameters Source DOE Exascale Workshop
                                                                                      • Exascale predicted speedups for Gaussian Elimination 2D CA
                                                                                      • 25D vs 2D LU With and Without Pivoting
                                                                                      • Other CA algorithms for Ax=b least squares(13)
                                                                                      • Other CA algorithms for Ax=b least squares (23)
                                                                                      • Other CA algorithms for Ax=b least squares (33)
                                                                                      • Outline (5)
                                                                                      • What about sparse matrices (13)
                                                                                      • Performance of 25D APSP using Kleene
                                                                                      • What about sparse matrices (23)
                                                                                      • What about sparse matrices (33)
                                                                                      • Outline (6)
                                                                                      • Symmetric Eigenproblem and SVD
                                                                                      • Slide 58
                                                                                      • Slide 59
                                                                                      • Slide 60
                                                                                      • Slide 61
                                                                                      • Slide 62
                                                                                      • Slide 63
                                                                                      • Slide 64
                                                                                      • Slide 65
                                                                                      • Slide 66
                                                                                      • Slide 67
                                                                                      • Slide 68
                                                                                      • Conventional vs CA - SBR
                                                                                      • Speedups of Sym Band Reduction vs DSBTRD
                                                                                      • Nonsymmetric Eigenproblem
                                                                                      • Attaining the Lower bounds Sequential
                                                                                      • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
                                                                                      • Outline (7)
                                                                                      • Avoiding Communication in Iterative Linear Algebra
                                                                                      • Outline (8)
                                                                                      • Example The Difficulty of Tuning SpMV
                                                                                      • Example The Difficulty of Tuning
                                                                                      • Speedups on Itanium 2 The Need for Search
                                                                                      • Register Profile Itanium 2
                                                                                      • Register Profiles IBM and Intel IA-64
                                                                                      • Another example of tuning challenges for SpMV
                                                                                      • Zoom in to top corner
                                                                                      • 3x3 blocks look natural buthellip
                                                                                      • Extra Work Can Improve Efficiency
                                                                                      • Slide 86
                                                                                      • Slide 87
                                                                                      • Slide 88
                                                                                      • Slide 89
                                                                                      • Summary of Other Performance Optimizations
                                                                                      • Optimized Sparse Kernel Interface - OSKI
                                                                                      • Outline (9)
                                                                                      • Example Classical Conjugate Gradient (CG)
                                                                                      • Example CA-Conjugate Gradient
                                                                                      • Outline (10)
                                                                                      • Slide 96
                                                                                      • Slide 97
                                                                                      • Outline (11)
                                                                                      • What is a ldquosparse matrixrdquo
                                                                                      • Outline (12)
                                                                                      • Reproducible Floating Point Computation
                                                                                      • Intel MKL non-reproducibility
                                                                                      • GoalsApproaches for Reproducibility
                                                                                      • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
                                                                                      • Collaborators and Supporters
                                                                                      • Summary

                                                                                        Exascale predicted speedupsfor Gaussian Elimination

                                                                                        2D CA-LU vs ScaLAPACK-LU

                                                                                        log2 (p)

                                                                                        log 2

                                                                                        (n2 p

                                                                                        ) =

                                                                                        log 2

                                                                                        (mem

                                                                                        ory_

                                                                                        per_

                                                                                        proc

                                                                                        )

                                                                                        Up to 29x

                                                                                        25D vs 2D LUWith and Without Pivoting

                                                                                        Other CA algorithms for Ax=b least squares(13)

                                                                                        bull A symmetric and indefinitendash Seek factorization that retains symmetry PAPT = LDLT D

                                                                                        ldquosimplerdquobull Save frac12 flops preserve inertia

                                                                                        ndash Usual approach Bunch-Kaufmanbull D block diagonal with 1x1 and 2x2 blocksbull Pivot search down column along row (lots of communication)

                                                                                        ndash Alternative Aasenbull D = tridiagonal = Tbull Two steps

                                                                                        ndash PAPT = LTLT where T is banded using TSLU

                                                                                        48

                                                                                        0 0

                                                                                        0

                                                                                        0 0

                                                                                        0

                                                                                        0

                                                                                        hellip

                                                                                        hellip

                                                                                        ndash Solvefactor narrow band problem with Tbull Up to 28x faster than MKL Best Paper at IPDPSrsquo13

                                                                                        Other CA algorithms for Ax=b least squares (23)bull Minimizing bandwidth and latency for sequential GEPP

                                                                                        ndash So far could not do partial pivoting and minimize messages just words

                                                                                        ndash Challengebull Column layout good for choosing pivots bad for matmulbull Blocked layout good for matmul bad for choosing pivots

                                                                                        ndash Solution use both layouts switching between thembull ldquoShape Morphing LUrdquo or SMLU

                                                                                        49

                                                                                        bull func factor(A) if A has 1 column update it else factor(left half of A)

                                                                                        update right half of A

                                                                                        factor(right half of A)

                                                                                        bull Words = O(n3M12)

                                                                                        bull Messages = O(n3M)

                                                                                        bull func factor(A) if A has 1 column update it else factor(left half of A) reshape to recursive block format update right half of A reshape to columnwise format factor(right half of A)

                                                                                        bull Words = O(n3M12)

                                                                                        bull Messages = O(n3M32)

                                                                                        Other CA algorithms for Ax=b least squares (33)bull Need for pivoting arises beyond LU in QR

                                                                                        ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)

                                                                                        ndash Usual approach like Partial Pivotingbull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound

                                                                                        ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two

                                                                                        groups of b columns either using usual approach or something better (GuEisenstat)

                                                                                        bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix

                                                                                        ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 50

                                                                                        Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                        ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                        ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                        bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                        bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                        What about sparse matrices (13)

                                                                                        bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

                                                                                        bull But canrsquot reorder outer loop for 25D need another ideabull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = AB

                                                                                        ndash Dependencies ok 25D works just different semiringbull Kleenersquos Algorithm

                                                                                        52

                                                                                        for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

                                                                                        D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

                                                                                        Performance of 25D APSP using Kleene

                                                                                        53

                                                                                        Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

                                                                                        62xspeedup

                                                                                        2x speedup

                                                                                        What about sparse matrices (23)

                                                                                        bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of G(A)

                                                                                        have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

                                                                                        bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering for

                                                                                        2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

                                                                                        bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

                                                                                        multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

                                                                                        separators)

                                                                                        54

                                                                                        What about sparse matrices (33)

                                                                                        bull If matrix stays very sparse lower bound unattainable new one bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

                                                                                        data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

                                                                                        bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

                                                                                        Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

                                                                                        bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

                                                                                        along dimensions most likely to minimize cost55

                                                                                        Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                        ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                        ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                        bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                        bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                        Symmetric Eigenproblem and SVD

                                                                                        bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd

                                                                                        bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12

                                                                                        ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea

                                                                                        b+1

                                                                                        b+1

                                                                                        Successive Band Reduction (BischofLangSun)

                                                                                        1

                                                                                        b+1

                                                                                        b+1

                                                                                        d+1

                                                                                        c

                                                                                        Successive Band Reduction (BischofLangSun)

                                                                                        b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                        1Q1

                                                                                        b+1

                                                                                        b+1

                                                                                        d+1

                                                                                        c

                                                                                        b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                        Successive Band Reduction (BischofLangSun)

                                                                                        12

                                                                                        Q1

                                                                                        b+1

                                                                                        b+1

                                                                                        d+1

                                                                                        d+c

                                                                                        d+c

                                                                                        c

                                                                                        b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                        Successive Band Reduction (BischofLangSun)

                                                                                        1

                                                                                        12

                                                                                        Q1

                                                                                        Q1T

                                                                                        b+1

                                                                                        b+1

                                                                                        d+1

                                                                                        d+1

                                                                                        cd+c

                                                                                        d+c

                                                                                        c

                                                                                        b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                        Successive Band Reduction (BischofLangSun)

                                                                                        1

                                                                                        1

                                                                                        2

                                                                                        2Q1

                                                                                        Q1T

                                                                                        b+1

                                                                                        b+1

                                                                                        d+1

                                                                                        d+1

                                                                                        cd+c

                                                                                        d+c

                                                                                        d+c

                                                                                        d+c

                                                                                        c

                                                                                        b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                        Successive Band Reduction (BischofLangSun)

                                                                                        1

                                                                                        1

                                                                                        2

                                                                                        2

                                                                                        3

                                                                                        3

                                                                                        Q1

                                                                                        Q1T

                                                                                        Q2

                                                                                        Q2T

                                                                                        b+1

                                                                                        b+1

                                                                                        d+1

                                                                                        d+1

                                                                                        d+c

                                                                                        d+c

                                                                                        d+c

                                                                                        d+c

                                                                                        c

                                                                                        c

                                                                                        b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                        Successive Band Reduction (BischofLangSun)

                                                                                        1

                                                                                        1

                                                                                        2

                                                                                        2

                                                                                        3

                                                                                        3

                                                                                        4

                                                                                        4

                                                                                        Q1

                                                                                        Q1T

                                                                                        Q2

                                                                                        Q2T

                                                                                        Q3

                                                                                        Q3T

                                                                                        b+1

                                                                                        b+1

                                                                                        d+1

                                                                                        d+1

                                                                                        d+c

                                                                                        d+c

                                                                                        d+c

                                                                                        d+c

                                                                                        c

                                                                                        c

                                                                                        b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                        Successive Band Reduction (BischofLangSun)

                                                                                        1

                                                                                        1

                                                                                        2

                                                                                        2

                                                                                        3

                                                                                        3

                                                                                        4

                                                                                        4

                                                                                        5

                                                                                        5

                                                                                        Q1

                                                                                        Q1T

                                                                                        Q2

                                                                                        Q2T

                                                                                        Q3

                                                                                        Q3T

                                                                                        Q4

                                                                                        Q4T

                                                                                        b+1

                                                                                        b+1

                                                                                        d+1

                                                                                        d+1

                                                                                        c

                                                                                        c

                                                                                        d+c

                                                                                        d+c

                                                                                        d+c

                                                                                        d+c

                                                                                        b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                        Successive Band Reduction (BischofLangSun)

                                                                                        1

                                                                                        1

                                                                                        2

                                                                                        2

                                                                                        3

                                                                                        3

                                                                                        4

                                                                                        4

                                                                                        5

                                                                                        5

                                                                                        Q5T

                                                                                        Q1

                                                                                        Q1T

                                                                                        Q2

                                                                                        Q2T

                                                                                        Q3

                                                                                        Q3T

                                                                                        Q5

                                                                                        Q4

                                                                                        Q4T

                                                                                        b+1

                                                                                        b+1

                                                                                        d+1

                                                                                        d+1

                                                                                        c

                                                                                        c

                                                                                        d+c

                                                                                        d+c

                                                                                        d+c

                                                                                        d+c

                                                                                        b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                        Successive Band Reduction (BischofLangSun)

                                                                                        1

                                                                                        1

                                                                                        2

                                                                                        2

                                                                                        3

                                                                                        3

                                                                                        4

                                                                                        4

                                                                                        5

                                                                                        5

                                                                                        6

                                                                                        6

                                                                                        Q5T

                                                                                        Q1

                                                                                        Q1T

                                                                                        Q2

                                                                                        Q2T

                                                                                        Q3

                                                                                        Q3T

                                                                                        Q5

                                                                                        Q4

                                                                                        Q4T

                                                                                        b+1

                                                                                        b+1

                                                                                        d+1

                                                                                        d+1

                                                                                        c

                                                                                        c

                                                                                        d+c

                                                                                        d+c

                                                                                        d+c

                                                                                        d+c

                                                                                        b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                        Successive Band Reduction (BischofLangSun)

                                                                                        Conventional vs CA - SBR

                                                                                        Conventional Communication-Avoiding

                                                                                        Touch all data 4 times Touch all data once

                                                                                        >
                                                                                        >

                                                                                        Speedups of Sym Band Reductionvs DSBTRD

                                                                                        bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

                                                                                        bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

                                                                                        bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

                                                                                        bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

                                                                                        bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

                                                                                        Nonsymmetric Eigenproblem

                                                                                        bull No apparent way to modify standard algorithmbull Instead Spectral Divide-and-Conquer

                                                                                        ndash Find orthogonal matrix Q whose leading columns span an invariant subspace of A

                                                                                        ndash QTAQ will be block upper triangularndash Apply recursively to A11 A22

                                                                                        ndash Depends on randomization1 Randomized Rank Revealing QR decomposition2 Randomized location to try splitting spectrum

                                                                                        A11 A12

                                                                                        ε A22

                                                                                        Attaining the Lower bounds SequentialLegend[Existing][Ours][Math-Lib][Random]

                                                                                        Two Levels Memory Hierarchy

                                                                                        Words Messages Words Messages

                                                                                        BLAS-3 [FLPRrsquo99][BDLSTrsquo13][MKL etc] [FLPRrsquo99][BDLSTrsquo13][MKL etc]

                                                                                        Cholesky[Grsquo97][APrsquo00]

                                                                                        [LAPACK][BDHSrsquo09]

                                                                                        [Grsquo97][APrsquo00][BDHSrsquo09] [Grsquo97][APrsquo00][BDHSrsquo09]

                                                                                        Sym Indefinite [BBDDDPSTYrsquo13] [BBDDDPSTYrsquo13]

                                                                                        LU[Grsquo97][Trsquo97]

                                                                                        [GDXrsquo11][BDLSTrsquo13]

                                                                                        [GDXrsquo11][BDLSTrsquo13]

                                                                                        [Grsquo97][Trsquo97] [BDLSTrsquo13] [BDLSTrsquo13]

                                                                                        QR[EGrsquo98][FWrsquo03]

                                                                                        [DGHLrsquo12][BDLSTrsquo13]

                                                                                        [FWrsquo03][DGHLrsquo12][BDLSTrsquo13]

                                                                                        [EGrsquo98][FWrsquo03][BDLSTrsquo13]

                                                                                        [FWrsquo03][BDLSTrsquo13]

                                                                                        Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

                                                                                        Sym Eig amp SVD [BDDrsquo11][BDKrsquo13] [BDDrsquo11]

                                                                                        Non Sym Eig [BDDrsquo11] [BDDrsquo11]

                                                                                        Legend[Existing][Ours][Math-Lib][Random]

                                                                                        Words (BW) Messages (L) Saving factor

                                                                                        BLAS-3 [AGZrsquo94][MTrsquo99][ScaLAPACK][Crsquo69][vGWrsquo97][SDrsquo11] L nP12

                                                                                        Cholesky [ScaLAPACK][Trsquo99][SDrsquo11] L nP12

                                                                                        Sym Indefinite [BBDDDPSTYrsquo13][ScaLAPACK] [BBDDDPSTYrsquo13] L nP12

                                                                                        LU [ScaLAPACK][GDXrsquo11][Trsquo99][SDrsquo11] [GDXrsquo11][Trsquo99][SDrsquo11] L nP12

                                                                                        QR [ScaLAPACK][DGHLrsquo12] [Trsquo99] [DGHLrsquo12][Trsquo99] L nP12

                                                                                        Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

                                                                                        Sym Eig amp SVD [BDDrsquo11][BDKrsquo13][ScaLAPACK] [BDDrsquo11][BDKrsquo13] L nP12

                                                                                        Non-Sym Eig [BDDrsquo11] [BDDrsquo11] BW P12 L n

                                                                                        Attaining with extra memory 25D M=(cn2P)

                                                                                        Attaining the Lower bounds Parallel 2DM=(n2P)(Ignoring poly-log(P) factors words = ( n2 P12) messages = (P12)

                                                                                        Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                        ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                        ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                        bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                        bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                        Avoiding Communication in Iterative Linear Algebra

                                                                                        bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

                                                                                        bull Conjugate Gradients (CG) GMRES Lanczos Arnoldi hellip bull Goal minimize communication

                                                                                        ndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

                                                                                        bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

                                                                                        ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

                                                                                        bull Lots of speed up possible (modeled and measured)ndash Price some redundant computationndash Challenges Poor partitioning Preconditioning Num Stability

                                                                                        75

                                                                                        Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                        ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                        ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                        bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                        bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                        Example The Difficulty of Tuning SpMV

                                                                                        bull n = 21200bull nnz = 15 M

                                                                                        bull Source NASA structural analysis problem (raefsky)

                                                                                        77

                                                                                        Example The Difficulty of Tuning

                                                                                        bull n = 21200bull nnz = 15 M

                                                                                        bull Source NASA structural analysis problem (raefsky)

                                                                                        bull 8x8 dense substructure exploit this to limit mem_refs

                                                                                        78

                                                                                        Speedups on Itanium 2 The Need for Search

                                                                                        Reference

                                                                                        Best 4x2

                                                                                        Mflops

                                                                                        Mflops

                                                                                        79

                                                                                        Register Profile Itanium 2

                                                                                        190 Mflops

                                                                                        1190 Mflops

                                                                                        80

                                                                                        Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16

                                                                                        Itanium 2 - 33Itanium 1 - 8

                                                                                        252 Mflops

                                                                                        122 Mflops

                                                                                        820 Mflops

                                                                                        459 Mflops

                                                                                        247 Mflops

                                                                                        107 Mflops

                                                                                        12 Gflops

                                                                                        190 Mflops

                                                                                        Another example of tuning challenges for SpMV

                                                                                        bull Ex11 matrix (fluid flow)

                                                                                        bull More complicated non-zero structure in general

                                                                                        bull N = 16614bull NNZ = 11M

                                                                                        82

                                                                                        Zoom in to top corner

                                                                                        bull More complicated non-zero structure in general

                                                                                        bull N = 16614bull NNZ = 11M

                                                                                        83

                                                                                        3x3 blocks look natural buthellip

                                                                                        bull Example 3x3 blockingndash Logical grid of 3x3 cells

                                                                                        bull But would lead to lots of ldquofill-inrdquo

                                                                                        84

                                                                                        Extra Work Can Improve Efficiency

                                                                                        bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

                                                                                        bull On Pentium III 15x speedupndash Actual mflop rate 152 = 225 higher

                                                                                        85

                                                                                        Source Accelerator Cavity Design Problem (Ko via Husbands)

                                                                                        86

                                                                                        100x100 Submatrix Along Diagonal

                                                                                        Summer School Lecture 787

                                                                                        Post-RCM Reordering

                                                                                        88

                                                                                        Effect of Combined RCM+TSP Reordering

                                                                                        Before Green + RedAfter Green + Blue

                                                                                        Summer School Lecture 789

                                                                                        2x speedups on Pentium 4 Power 4 hellip

                                                                                        Summary of Other Performance Optimizations

                                                                                        bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

                                                                                        bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

                                                                                        bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

                                                                                        90

                                                                                        Optimized Sparse Kernel Interface - OSKI

                                                                                        bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

                                                                                        bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

                                                                                        bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

                                                                                        software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

                                                                                        91

                                                                                        Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                        ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                        ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                        bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                        bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                        93

                                                                                        Example Classical Conjugate Gradient (CG)

                                                                                        SpMVs and dot products require communication in

                                                                                        each iteration

                                                                                        via CA Matrix Powers Kernel

                                                                                        Global reduction to compute G

                                                                                        94

                                                                                        Example CA-Conjugate Gradient

                                                                                        Local computations within inner loop require

                                                                                        no communication

                                                                                        Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                        ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                        ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                        bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                        bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                        96

                                                                                        Slower convergence due

                                                                                        to roundoff

                                                                                        Loss of accuracy due to roundoff

                                                                                        At s = 16 monomial basis is rank deficient Method breaks down

                                                                                        Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

                                                                                        CA-CG (monomial)CG

                                                                                        machine precision

                                                                                        97

                                                                                        Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                        ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                        ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                        bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                        bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                        What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

                                                                                        bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

                                                                                        bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

                                                                                        matrices

                                                                                        Explicit (O(nnz)) Implicit (o(nnz))

                                                                                        Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

                                                                                        Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

                                                                                        Indices

                                                                                        Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                        ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                        ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                        bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                        bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                        101

                                                                                        bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

                                                                                        ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

                                                                                        demanded by customers (construction engineers) otherwise they donrsquot believe results

                                                                                        ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

                                                                                        ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

                                                                                        Reproducible Floating Point Computation

                                                                                        Absolute Error for Random Vectors

                                                                                        Same magnitude opposite signs

                                                                                        Intel MKL non-reproducibility

                                                                                        Relative Error for Orthogonal vectors

                                                                                        Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

                                                                                        Sign notreproducible

                                                                                        103

                                                                                        bull Consider summation or dot productbull Goals

                                                                                        1 Same answer independent of layout processors order of summands

                                                                                        2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

                                                                                        bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

                                                                                        GoalsApproaches for Reproducibility

                                                                                        104

                                                                                        Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

                                                                                        Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

                                                                                        Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

                                                                                        bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

                                                                                        Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

                                                                                        bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

                                                                                        Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

                                                                                        Instruments NEC Nokia NVIDIA Samsung Oracle

                                                                                        bull bebopcsberkeleyedu

                                                                                        Summary

                                                                                        Donrsquot Communichellip

                                                                                        106

                                                                                        Time to redesign all linear algebra n-body hellip algorithms and software

                                                                                        (and compilers)

                                                                                        • Implementing Communication-Avoiding Algorithms
                                                                                        • Why avoid communication
                                                                                        • Goals
                                                                                        • Outline
                                                                                        • Outline (2)
                                                                                        • Lower bound for all ldquon3-likerdquo linear algebra
                                                                                        • Lower bound for all ldquon3-likerdquo linear algebra (2)
                                                                                        • Lower bound for all ldquon3-likerdquo linear algebra (3)
                                                                                        • Limits to parallel scaling (12)
                                                                                        • Limits to parallel scaling (22)
                                                                                        • Can we attain these lower bounds
                                                                                        • Outline (3)
                                                                                        • 25D Matrix Multiplication
                                                                                        • 25D Matrix Multiplication (2)
                                                                                        • 25D Matmul on BGP 16K nodes 64K cores (2)
                                                                                        • Perfect Strong Scaling ndash in Time and Energy (12)
                                                                                        • Perfect Strong Scaling ndash in Time and Energy (22)
                                                                                        • Handling Heterogeneity
                                                                                        • Application to Tensor Contractions
                                                                                        • C(ijk) = Σm A(ijm)B(mk)
                                                                                        • Application to Tensor Contractions (2)
                                                                                        • Communication Lower Bounds for Strassen-like matmul algorithms
                                                                                        • vs
                                                                                        • Slide 26
                                                                                        • Strassen-like beyond matmul
                                                                                        • Cache and Network Oblivious Algorithms
                                                                                        • CARMA Performance Distributed Memory
                                                                                        • CARMA Performance Distributed Memory (2)
                                                                                        • CARMA Performance Shared Memory
                                                                                        • CARMA Performance Shared Memory (2)
                                                                                        • Why is CARMA Faster in Shared Memory
                                                                                        • Outline (4)
                                                                                        • One-sided Factorizations (LU QR) so far
                                                                                        • TSQR An Architecture-Dependent Algorithm
                                                                                        • Back to LU Using similar idea for TSLU as TSQR Use reduction
                                                                                        • Minimizing Communication in TSLU
                                                                                        • Making TSLU Numerically Stable
                                                                                        • Stability of LU using TSLU CALU
                                                                                        • Why is stability of TSLU just a ldquoThmrdquo
                                                                                        • Fixing TSLU
                                                                                        • 2D CALU with Tournament Pivoting
                                                                                        • 25D CALU with Tournament Pivoting (c=4 copies)
                                                                                        • Exascale Machine Parameters Source DOE Exascale Workshop
                                                                                        • Exascale predicted speedups for Gaussian Elimination 2D CA
                                                                                        • 25D vs 2D LU With and Without Pivoting
                                                                                        • Other CA algorithms for Ax=b least squares(13)
                                                                                        • Other CA algorithms for Ax=b least squares (23)
                                                                                        • Other CA algorithms for Ax=b least squares (33)
                                                                                        • Outline (5)
                                                                                        • What about sparse matrices (13)
                                                                                        • Performance of 25D APSP using Kleene
                                                                                        • What about sparse matrices (23)
                                                                                        • What about sparse matrices (33)
                                                                                        • Outline (6)
                                                                                        • Symmetric Eigenproblem and SVD
                                                                                        • Slide 58
                                                                                        • Slide 59
                                                                                        • Slide 60
                                                                                        • Slide 61
                                                                                        • Slide 62
                                                                                        • Slide 63
                                                                                        • Slide 64
                                                                                        • Slide 65
                                                                                        • Slide 66
                                                                                        • Slide 67
                                                                                        • Slide 68
                                                                                        • Conventional vs CA - SBR
                                                                                        • Speedups of Sym Band Reduction vs DSBTRD
                                                                                        • Nonsymmetric Eigenproblem
                                                                                        • Attaining the Lower bounds Sequential
                                                                                        • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
                                                                                        • Outline (7)
                                                                                        • Avoiding Communication in Iterative Linear Algebra
                                                                                        • Outline (8)
                                                                                        • Example The Difficulty of Tuning SpMV
                                                                                        • Example The Difficulty of Tuning
                                                                                        • Speedups on Itanium 2 The Need for Search
                                                                                        • Register Profile Itanium 2
                                                                                        • Register Profiles IBM and Intel IA-64
                                                                                        • Another example of tuning challenges for SpMV
                                                                                        • Zoom in to top corner
                                                                                        • 3x3 blocks look natural buthellip
                                                                                        • Extra Work Can Improve Efficiency
                                                                                        • Slide 86
                                                                                        • Slide 87
                                                                                        • Slide 88
                                                                                        • Slide 89
                                                                                        • Summary of Other Performance Optimizations
                                                                                        • Optimized Sparse Kernel Interface - OSKI
                                                                                        • Outline (9)
                                                                                        • Example Classical Conjugate Gradient (CG)
                                                                                        • Example CA-Conjugate Gradient
                                                                                        • Outline (10)
                                                                                        • Slide 96
                                                                                        • Slide 97
                                                                                        • Outline (11)
                                                                                        • What is a ldquosparse matrixrdquo
                                                                                        • Outline (12)
                                                                                        • Reproducible Floating Point Computation
                                                                                        • Intel MKL non-reproducibility
                                                                                        • GoalsApproaches for Reproducibility
                                                                                        • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
                                                                                        • Collaborators and Supporters
                                                                                        • Summary

                                                                                          25D vs 2D LUWith and Without Pivoting

                                                                                          Other CA algorithms for Ax=b least squares(13)

                                                                                          bull A symmetric and indefinitendash Seek factorization that retains symmetry PAPT = LDLT D

                                                                                          ldquosimplerdquobull Save frac12 flops preserve inertia

                                                                                          ndash Usual approach Bunch-Kaufmanbull D block diagonal with 1x1 and 2x2 blocksbull Pivot search down column along row (lots of communication)

                                                                                          ndash Alternative Aasenbull D = tridiagonal = Tbull Two steps

                                                                                          ndash PAPT = LTLT where T is banded using TSLU

                                                                                          48

                                                                                          0 0

                                                                                          0

                                                                                          0 0

                                                                                          0

                                                                                          0

                                                                                          hellip

                                                                                          hellip

                                                                                          ndash Solvefactor narrow band problem with Tbull Up to 28x faster than MKL Best Paper at IPDPSrsquo13

                                                                                          Other CA algorithms for Ax=b least squares (23)bull Minimizing bandwidth and latency for sequential GEPP

                                                                                          ndash So far could not do partial pivoting and minimize messages just words

                                                                                          ndash Challengebull Column layout good for choosing pivots bad for matmulbull Blocked layout good for matmul bad for choosing pivots

                                                                                          ndash Solution use both layouts switching between thembull ldquoShape Morphing LUrdquo or SMLU

                                                                                          49

                                                                                          bull func factor(A) if A has 1 column update it else factor(left half of A)

                                                                                          update right half of A

                                                                                          factor(right half of A)

                                                                                          bull Words = O(n3M12)

                                                                                          bull Messages = O(n3M)

                                                                                          bull func factor(A) if A has 1 column update it else factor(left half of A) reshape to recursive block format update right half of A reshape to columnwise format factor(right half of A)

                                                                                          bull Words = O(n3M12)

                                                                                          bull Messages = O(n3M32)

                                                                                          Other CA algorithms for Ax=b least squares (33)bull Need for pivoting arises beyond LU in QR

                                                                                          ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)

                                                                                          ndash Usual approach like Partial Pivotingbull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound

                                                                                          ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two

                                                                                          groups of b columns either using usual approach or something better (GuEisenstat)

                                                                                          bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix

                                                                                          ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 50

                                                                                          Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                          ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                          ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                          bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                          bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                          What about sparse matrices (13)

                                                                                          bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

                                                                                          bull But canrsquot reorder outer loop for 25D need another ideabull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = AB

                                                                                          ndash Dependencies ok 25D works just different semiringbull Kleenersquos Algorithm

                                                                                          52

                                                                                          for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

                                                                                          D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

                                                                                          Performance of 25D APSP using Kleene

                                                                                          53

                                                                                          Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

                                                                                          62xspeedup

                                                                                          2x speedup

                                                                                          What about sparse matrices (23)

                                                                                          bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of G(A)

                                                                                          have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

                                                                                          bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering for

                                                                                          2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

                                                                                          bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

                                                                                          multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

                                                                                          separators)

                                                                                          54

                                                                                          What about sparse matrices (33)

                                                                                          bull If matrix stays very sparse lower bound unattainable new one bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

                                                                                          data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

                                                                                          bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

                                                                                          Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

                                                                                          bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

                                                                                          along dimensions most likely to minimize cost55

                                                                                          Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                          ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                          ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                          bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                          bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                          Symmetric Eigenproblem and SVD

                                                                                          bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd

                                                                                          bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12

                                                                                          ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea

                                                                                          b+1

                                                                                          b+1

                                                                                          Successive Band Reduction (BischofLangSun)

                                                                                          1

                                                                                          b+1

                                                                                          b+1

                                                                                          d+1

                                                                                          c

                                                                                          Successive Band Reduction (BischofLangSun)

                                                                                          b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                          1Q1

                                                                                          b+1

                                                                                          b+1

                                                                                          d+1

                                                                                          c

                                                                                          b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                          Successive Band Reduction (BischofLangSun)

                                                                                          12

                                                                                          Q1

                                                                                          b+1

                                                                                          b+1

                                                                                          d+1

                                                                                          d+c

                                                                                          d+c

                                                                                          c

                                                                                          b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                          Successive Band Reduction (BischofLangSun)

                                                                                          1

                                                                                          12

                                                                                          Q1

                                                                                          Q1T

                                                                                          b+1

                                                                                          b+1

                                                                                          d+1

                                                                                          d+1

                                                                                          cd+c

                                                                                          d+c

                                                                                          c

                                                                                          b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                          Successive Band Reduction (BischofLangSun)

                                                                                          1

                                                                                          1

                                                                                          2

                                                                                          2Q1

                                                                                          Q1T

                                                                                          b+1

                                                                                          b+1

                                                                                          d+1

                                                                                          d+1

                                                                                          cd+c

                                                                                          d+c

                                                                                          d+c

                                                                                          d+c

                                                                                          c

                                                                                          b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                          Successive Band Reduction (BischofLangSun)

                                                                                          1

                                                                                          1

                                                                                          2

                                                                                          2

                                                                                          3

                                                                                          3

                                                                                          Q1

                                                                                          Q1T

                                                                                          Q2

                                                                                          Q2T

                                                                                          b+1

                                                                                          b+1

                                                                                          d+1

                                                                                          d+1

                                                                                          d+c

                                                                                          d+c

                                                                                          d+c

                                                                                          d+c

                                                                                          c

                                                                                          c

                                                                                          b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                          Successive Band Reduction (BischofLangSun)

                                                                                          1

                                                                                          1

                                                                                          2

                                                                                          2

                                                                                          3

                                                                                          3

                                                                                          4

                                                                                          4

                                                                                          Q1

                                                                                          Q1T

                                                                                          Q2

                                                                                          Q2T

                                                                                          Q3

                                                                                          Q3T

                                                                                          b+1

                                                                                          b+1

                                                                                          d+1

                                                                                          d+1

                                                                                          d+c

                                                                                          d+c

                                                                                          d+c

                                                                                          d+c

                                                                                          c

                                                                                          c

                                                                                          b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                          Successive Band Reduction (BischofLangSun)

                                                                                          1

                                                                                          1

                                                                                          2

                                                                                          2

                                                                                          3

                                                                                          3

                                                                                          4

                                                                                          4

                                                                                          5

                                                                                          5

                                                                                          Q1

                                                                                          Q1T

                                                                                          Q2

                                                                                          Q2T

                                                                                          Q3

                                                                                          Q3T

                                                                                          Q4

                                                                                          Q4T

                                                                                          b+1

                                                                                          b+1

                                                                                          d+1

                                                                                          d+1

                                                                                          c

                                                                                          c

                                                                                          d+c

                                                                                          d+c

                                                                                          d+c

                                                                                          d+c

                                                                                          b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                          Successive Band Reduction (BischofLangSun)

                                                                                          1

                                                                                          1

                                                                                          2

                                                                                          2

                                                                                          3

                                                                                          3

                                                                                          4

                                                                                          4

                                                                                          5

                                                                                          5

                                                                                          Q5T

                                                                                          Q1

                                                                                          Q1T

                                                                                          Q2

                                                                                          Q2T

                                                                                          Q3

                                                                                          Q3T

                                                                                          Q5

                                                                                          Q4

                                                                                          Q4T

                                                                                          b+1

                                                                                          b+1

                                                                                          d+1

                                                                                          d+1

                                                                                          c

                                                                                          c

                                                                                          d+c

                                                                                          d+c

                                                                                          d+c

                                                                                          d+c

                                                                                          b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                          Successive Band Reduction (BischofLangSun)

                                                                                          1

                                                                                          1

                                                                                          2

                                                                                          2

                                                                                          3

                                                                                          3

                                                                                          4

                                                                                          4

                                                                                          5

                                                                                          5

                                                                                          6

                                                                                          6

                                                                                          Q5T

                                                                                          Q1

                                                                                          Q1T

                                                                                          Q2

                                                                                          Q2T

                                                                                          Q3

                                                                                          Q3T

                                                                                          Q5

                                                                                          Q4

                                                                                          Q4T

                                                                                          b+1

                                                                                          b+1

                                                                                          d+1

                                                                                          d+1

                                                                                          c

                                                                                          c

                                                                                          d+c

                                                                                          d+c

                                                                                          d+c

                                                                                          d+c

                                                                                          b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                          Successive Band Reduction (BischofLangSun)

                                                                                          Conventional vs CA - SBR

                                                                                          Conventional Communication-Avoiding

                                                                                          Touch all data 4 times Touch all data once

                                                                                          >
                                                                                          >

                                                                                          Speedups of Sym Band Reductionvs DSBTRD

                                                                                          bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

                                                                                          bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

                                                                                          bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

                                                                                          bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

                                                                                          bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

                                                                                          Nonsymmetric Eigenproblem

                                                                                          bull No apparent way to modify standard algorithmbull Instead Spectral Divide-and-Conquer

                                                                                          ndash Find orthogonal matrix Q whose leading columns span an invariant subspace of A

                                                                                          ndash QTAQ will be block upper triangularndash Apply recursively to A11 A22

                                                                                          ndash Depends on randomization1 Randomized Rank Revealing QR decomposition2 Randomized location to try splitting spectrum

                                                                                          A11 A12

                                                                                          ε A22

                                                                                          Attaining the Lower bounds SequentialLegend[Existing][Ours][Math-Lib][Random]

                                                                                          Two Levels Memory Hierarchy

                                                                                          Words Messages Words Messages

                                                                                          BLAS-3 [FLPRrsquo99][BDLSTrsquo13][MKL etc] [FLPRrsquo99][BDLSTrsquo13][MKL etc]

                                                                                          Cholesky[Grsquo97][APrsquo00]

                                                                                          [LAPACK][BDHSrsquo09]

                                                                                          [Grsquo97][APrsquo00][BDHSrsquo09] [Grsquo97][APrsquo00][BDHSrsquo09]

                                                                                          Sym Indefinite [BBDDDPSTYrsquo13] [BBDDDPSTYrsquo13]

                                                                                          LU[Grsquo97][Trsquo97]

                                                                                          [GDXrsquo11][BDLSTrsquo13]

                                                                                          [GDXrsquo11][BDLSTrsquo13]

                                                                                          [Grsquo97][Trsquo97] [BDLSTrsquo13] [BDLSTrsquo13]

                                                                                          QR[EGrsquo98][FWrsquo03]

                                                                                          [DGHLrsquo12][BDLSTrsquo13]

                                                                                          [FWrsquo03][DGHLrsquo12][BDLSTrsquo13]

                                                                                          [EGrsquo98][FWrsquo03][BDLSTrsquo13]

                                                                                          [FWrsquo03][BDLSTrsquo13]

                                                                                          Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

                                                                                          Sym Eig amp SVD [BDDrsquo11][BDKrsquo13] [BDDrsquo11]

                                                                                          Non Sym Eig [BDDrsquo11] [BDDrsquo11]

                                                                                          Legend[Existing][Ours][Math-Lib][Random]

                                                                                          Words (BW) Messages (L) Saving factor

                                                                                          BLAS-3 [AGZrsquo94][MTrsquo99][ScaLAPACK][Crsquo69][vGWrsquo97][SDrsquo11] L nP12

                                                                                          Cholesky [ScaLAPACK][Trsquo99][SDrsquo11] L nP12

                                                                                          Sym Indefinite [BBDDDPSTYrsquo13][ScaLAPACK] [BBDDDPSTYrsquo13] L nP12

                                                                                          LU [ScaLAPACK][GDXrsquo11][Trsquo99][SDrsquo11] [GDXrsquo11][Trsquo99][SDrsquo11] L nP12

                                                                                          QR [ScaLAPACK][DGHLrsquo12] [Trsquo99] [DGHLrsquo12][Trsquo99] L nP12

                                                                                          Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

                                                                                          Sym Eig amp SVD [BDDrsquo11][BDKrsquo13][ScaLAPACK] [BDDrsquo11][BDKrsquo13] L nP12

                                                                                          Non-Sym Eig [BDDrsquo11] [BDDrsquo11] BW P12 L n

                                                                                          Attaining with extra memory 25D M=(cn2P)

                                                                                          Attaining the Lower bounds Parallel 2DM=(n2P)(Ignoring poly-log(P) factors words = ( n2 P12) messages = (P12)

                                                                                          Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                          ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                          ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                          bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                          bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                          Avoiding Communication in Iterative Linear Algebra

                                                                                          bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

                                                                                          bull Conjugate Gradients (CG) GMRES Lanczos Arnoldi hellip bull Goal minimize communication

                                                                                          ndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

                                                                                          bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

                                                                                          ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

                                                                                          bull Lots of speed up possible (modeled and measured)ndash Price some redundant computationndash Challenges Poor partitioning Preconditioning Num Stability

                                                                                          75

                                                                                          Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                          ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                          ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                          bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                          bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                          Example The Difficulty of Tuning SpMV

                                                                                          bull n = 21200bull nnz = 15 M

                                                                                          bull Source NASA structural analysis problem (raefsky)

                                                                                          77

                                                                                          Example The Difficulty of Tuning

                                                                                          bull n = 21200bull nnz = 15 M

                                                                                          bull Source NASA structural analysis problem (raefsky)

                                                                                          bull 8x8 dense substructure exploit this to limit mem_refs

                                                                                          78

                                                                                          Speedups on Itanium 2 The Need for Search

                                                                                          Reference

                                                                                          Best 4x2

                                                                                          Mflops

                                                                                          Mflops

                                                                                          79

                                                                                          Register Profile Itanium 2

                                                                                          190 Mflops

                                                                                          1190 Mflops

                                                                                          80

                                                                                          Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16

                                                                                          Itanium 2 - 33Itanium 1 - 8

                                                                                          252 Mflops

                                                                                          122 Mflops

                                                                                          820 Mflops

                                                                                          459 Mflops

                                                                                          247 Mflops

                                                                                          107 Mflops

                                                                                          12 Gflops

                                                                                          190 Mflops

                                                                                          Another example of tuning challenges for SpMV

                                                                                          bull Ex11 matrix (fluid flow)

                                                                                          bull More complicated non-zero structure in general

                                                                                          bull N = 16614bull NNZ = 11M

                                                                                          82

                                                                                          Zoom in to top corner

                                                                                          bull More complicated non-zero structure in general

                                                                                          bull N = 16614bull NNZ = 11M

                                                                                          83

                                                                                          3x3 blocks look natural buthellip

                                                                                          bull Example 3x3 blockingndash Logical grid of 3x3 cells

                                                                                          bull But would lead to lots of ldquofill-inrdquo

                                                                                          84

                                                                                          Extra Work Can Improve Efficiency

                                                                                          bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

                                                                                          bull On Pentium III 15x speedupndash Actual mflop rate 152 = 225 higher

                                                                                          85

                                                                                          Source Accelerator Cavity Design Problem (Ko via Husbands)

                                                                                          86

                                                                                          100x100 Submatrix Along Diagonal

                                                                                          Summer School Lecture 787

                                                                                          Post-RCM Reordering

                                                                                          88

                                                                                          Effect of Combined RCM+TSP Reordering

                                                                                          Before Green + RedAfter Green + Blue

                                                                                          Summer School Lecture 789

                                                                                          2x speedups on Pentium 4 Power 4 hellip

                                                                                          Summary of Other Performance Optimizations

                                                                                          bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

                                                                                          bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

                                                                                          bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

                                                                                          90

                                                                                          Optimized Sparse Kernel Interface - OSKI

                                                                                          bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

                                                                                          bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

                                                                                          bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

                                                                                          software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

                                                                                          91

                                                                                          Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                          ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                          ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                          bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                          bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                          93

                                                                                          Example Classical Conjugate Gradient (CG)

                                                                                          SpMVs and dot products require communication in

                                                                                          each iteration

                                                                                          via CA Matrix Powers Kernel

                                                                                          Global reduction to compute G

                                                                                          94

                                                                                          Example CA-Conjugate Gradient

                                                                                          Local computations within inner loop require

                                                                                          no communication

                                                                                          Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                          ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                          ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                          bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                          bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                          96

                                                                                          Slower convergence due

                                                                                          to roundoff

                                                                                          Loss of accuracy due to roundoff

                                                                                          At s = 16 monomial basis is rank deficient Method breaks down

                                                                                          Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

                                                                                          CA-CG (monomial)CG

                                                                                          machine precision

                                                                                          97

                                                                                          Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                          ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                          ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                          bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                          bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                          What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

                                                                                          bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

                                                                                          bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

                                                                                          matrices

                                                                                          Explicit (O(nnz)) Implicit (o(nnz))

                                                                                          Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

                                                                                          Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

                                                                                          Indices

                                                                                          Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                          ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                          ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                          bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                          bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                          101

                                                                                          bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

                                                                                          ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

                                                                                          demanded by customers (construction engineers) otherwise they donrsquot believe results

                                                                                          ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

                                                                                          ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

                                                                                          Reproducible Floating Point Computation

                                                                                          Absolute Error for Random Vectors

                                                                                          Same magnitude opposite signs

                                                                                          Intel MKL non-reproducibility

                                                                                          Relative Error for Orthogonal vectors

                                                                                          Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

                                                                                          Sign notreproducible

                                                                                          103

                                                                                          bull Consider summation or dot productbull Goals

                                                                                          1 Same answer independent of layout processors order of summands

                                                                                          2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

                                                                                          bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

                                                                                          GoalsApproaches for Reproducibility

                                                                                          104

                                                                                          Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

                                                                                          Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

                                                                                          Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

                                                                                          bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

                                                                                          Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

                                                                                          bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

                                                                                          Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

                                                                                          Instruments NEC Nokia NVIDIA Samsung Oracle

                                                                                          bull bebopcsberkeleyedu

                                                                                          Summary

                                                                                          Donrsquot Communichellip

                                                                                          106

                                                                                          Time to redesign all linear algebra n-body hellip algorithms and software

                                                                                          (and compilers)

                                                                                          • Implementing Communication-Avoiding Algorithms
                                                                                          • Why avoid communication
                                                                                          • Goals
                                                                                          • Outline
                                                                                          • Outline (2)
                                                                                          • Lower bound for all ldquon3-likerdquo linear algebra
                                                                                          • Lower bound for all ldquon3-likerdquo linear algebra (2)
                                                                                          • Lower bound for all ldquon3-likerdquo linear algebra (3)
                                                                                          • Limits to parallel scaling (12)
                                                                                          • Limits to parallel scaling (22)
                                                                                          • Can we attain these lower bounds
                                                                                          • Outline (3)
                                                                                          • 25D Matrix Multiplication
                                                                                          • 25D Matrix Multiplication (2)
                                                                                          • 25D Matmul on BGP 16K nodes 64K cores (2)
                                                                                          • Perfect Strong Scaling ndash in Time and Energy (12)
                                                                                          • Perfect Strong Scaling ndash in Time and Energy (22)
                                                                                          • Handling Heterogeneity
                                                                                          • Application to Tensor Contractions
                                                                                          • C(ijk) = Σm A(ijm)B(mk)
                                                                                          • Application to Tensor Contractions (2)
                                                                                          • Communication Lower Bounds for Strassen-like matmul algorithms
                                                                                          • vs
                                                                                          • Slide 26
                                                                                          • Strassen-like beyond matmul
                                                                                          • Cache and Network Oblivious Algorithms
                                                                                          • CARMA Performance Distributed Memory
                                                                                          • CARMA Performance Distributed Memory (2)
                                                                                          • CARMA Performance Shared Memory
                                                                                          • CARMA Performance Shared Memory (2)
                                                                                          • Why is CARMA Faster in Shared Memory
                                                                                          • Outline (4)
                                                                                          • One-sided Factorizations (LU QR) so far
                                                                                          • TSQR An Architecture-Dependent Algorithm
                                                                                          • Back to LU Using similar idea for TSLU as TSQR Use reduction
                                                                                          • Minimizing Communication in TSLU
                                                                                          • Making TSLU Numerically Stable
                                                                                          • Stability of LU using TSLU CALU
                                                                                          • Why is stability of TSLU just a ldquoThmrdquo
                                                                                          • Fixing TSLU
                                                                                          • 2D CALU with Tournament Pivoting
                                                                                          • 25D CALU with Tournament Pivoting (c=4 copies)
                                                                                          • Exascale Machine Parameters Source DOE Exascale Workshop
                                                                                          • Exascale predicted speedups for Gaussian Elimination 2D CA
                                                                                          • 25D vs 2D LU With and Without Pivoting
                                                                                          • Other CA algorithms for Ax=b least squares(13)
                                                                                          • Other CA algorithms for Ax=b least squares (23)
                                                                                          • Other CA algorithms for Ax=b least squares (33)
                                                                                          • Outline (5)
                                                                                          • What about sparse matrices (13)
                                                                                          • Performance of 25D APSP using Kleene
                                                                                          • What about sparse matrices (23)
                                                                                          • What about sparse matrices (33)
                                                                                          • Outline (6)
                                                                                          • Symmetric Eigenproblem and SVD
                                                                                          • Slide 58
                                                                                          • Slide 59
                                                                                          • Slide 60
                                                                                          • Slide 61
                                                                                          • Slide 62
                                                                                          • Slide 63
                                                                                          • Slide 64
                                                                                          • Slide 65
                                                                                          • Slide 66
                                                                                          • Slide 67
                                                                                          • Slide 68
                                                                                          • Conventional vs CA - SBR
                                                                                          • Speedups of Sym Band Reduction vs DSBTRD
                                                                                          • Nonsymmetric Eigenproblem
                                                                                          • Attaining the Lower bounds Sequential
                                                                                          • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
                                                                                          • Outline (7)
                                                                                          • Avoiding Communication in Iterative Linear Algebra
                                                                                          • Outline (8)
                                                                                          • Example The Difficulty of Tuning SpMV
                                                                                          • Example The Difficulty of Tuning
                                                                                          • Speedups on Itanium 2 The Need for Search
                                                                                          • Register Profile Itanium 2
                                                                                          • Register Profiles IBM and Intel IA-64
                                                                                          • Another example of tuning challenges for SpMV
                                                                                          • Zoom in to top corner
                                                                                          • 3x3 blocks look natural buthellip
                                                                                          • Extra Work Can Improve Efficiency
                                                                                          • Slide 86
                                                                                          • Slide 87
                                                                                          • Slide 88
                                                                                          • Slide 89
                                                                                          • Summary of Other Performance Optimizations
                                                                                          • Optimized Sparse Kernel Interface - OSKI
                                                                                          • Outline (9)
                                                                                          • Example Classical Conjugate Gradient (CG)
                                                                                          • Example CA-Conjugate Gradient
                                                                                          • Outline (10)
                                                                                          • Slide 96
                                                                                          • Slide 97
                                                                                          • Outline (11)
                                                                                          • What is a ldquosparse matrixrdquo
                                                                                          • Outline (12)
                                                                                          • Reproducible Floating Point Computation
                                                                                          • Intel MKL non-reproducibility
                                                                                          • GoalsApproaches for Reproducibility
                                                                                          • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
                                                                                          • Collaborators and Supporters
                                                                                          • Summary

                                                                                            Other CA algorithms for Ax=b least squares(13)

                                                                                            bull A symmetric and indefinitendash Seek factorization that retains symmetry PAPT = LDLT D

                                                                                            ldquosimplerdquobull Save frac12 flops preserve inertia

                                                                                            ndash Usual approach Bunch-Kaufmanbull D block diagonal with 1x1 and 2x2 blocksbull Pivot search down column along row (lots of communication)

                                                                                            ndash Alternative Aasenbull D = tridiagonal = Tbull Two steps

                                                                                            ndash PAPT = LTLT where T is banded using TSLU

                                                                                            48

                                                                                            0 0

                                                                                            0

                                                                                            0 0

                                                                                            0

                                                                                            0

                                                                                            hellip

                                                                                            hellip

                                                                                            ndash Solvefactor narrow band problem with Tbull Up to 28x faster than MKL Best Paper at IPDPSrsquo13

                                                                                            Other CA algorithms for Ax=b least squares (23)bull Minimizing bandwidth and latency for sequential GEPP

                                                                                            ndash So far could not do partial pivoting and minimize messages just words

                                                                                            ndash Challengebull Column layout good for choosing pivots bad for matmulbull Blocked layout good for matmul bad for choosing pivots

                                                                                            ndash Solution use both layouts switching between thembull ldquoShape Morphing LUrdquo or SMLU

                                                                                            49

                                                                                            bull func factor(A) if A has 1 column update it else factor(left half of A)

                                                                                            update right half of A

                                                                                            factor(right half of A)

                                                                                            bull Words = O(n3M12)

                                                                                            bull Messages = O(n3M)

                                                                                            bull func factor(A) if A has 1 column update it else factor(left half of A) reshape to recursive block format update right half of A reshape to columnwise format factor(right half of A)

                                                                                            bull Words = O(n3M12)

                                                                                            bull Messages = O(n3M32)

                                                                                            Other CA algorithms for Ax=b least squares (33)bull Need for pivoting arises beyond LU in QR

                                                                                            ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)

                                                                                            ndash Usual approach like Partial Pivotingbull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound

                                                                                            ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two

                                                                                            groups of b columns either using usual approach or something better (GuEisenstat)

                                                                                            bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix

                                                                                            ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 50

                                                                                            Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                            ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                            ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                            bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                            bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                            What about sparse matrices (13)

                                                                                            bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

                                                                                            bull But canrsquot reorder outer loop for 25D need another ideabull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = AB

                                                                                            ndash Dependencies ok 25D works just different semiringbull Kleenersquos Algorithm

                                                                                            52

                                                                                            for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

                                                                                            D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

                                                                                            Performance of 25D APSP using Kleene

                                                                                            53

                                                                                            Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

                                                                                            62xspeedup

                                                                                            2x speedup

                                                                                            What about sparse matrices (23)

                                                                                            bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of G(A)

                                                                                            have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

                                                                                            bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering for

                                                                                            2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

                                                                                            bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

                                                                                            multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

                                                                                            separators)

                                                                                            54

                                                                                            What about sparse matrices (33)

                                                                                            bull If matrix stays very sparse lower bound unattainable new one bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

                                                                                            data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

                                                                                            bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

                                                                                            Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

                                                                                            bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

                                                                                            along dimensions most likely to minimize cost55

                                                                                            Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                            ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                            ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                            bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                            bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                            Symmetric Eigenproblem and SVD

                                                                                            bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd

                                                                                            bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12

                                                                                            ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea

                                                                                            b+1

                                                                                            b+1

                                                                                            Successive Band Reduction (BischofLangSun)

                                                                                            1

                                                                                            b+1

                                                                                            b+1

                                                                                            d+1

                                                                                            c

                                                                                            Successive Band Reduction (BischofLangSun)

                                                                                            b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                            1Q1

                                                                                            b+1

                                                                                            b+1

                                                                                            d+1

                                                                                            c

                                                                                            b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                            Successive Band Reduction (BischofLangSun)

                                                                                            12

                                                                                            Q1

                                                                                            b+1

                                                                                            b+1

                                                                                            d+1

                                                                                            d+c

                                                                                            d+c

                                                                                            c

                                                                                            b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                            Successive Band Reduction (BischofLangSun)

                                                                                            1

                                                                                            12

                                                                                            Q1

                                                                                            Q1T

                                                                                            b+1

                                                                                            b+1

                                                                                            d+1

                                                                                            d+1

                                                                                            cd+c

                                                                                            d+c

                                                                                            c

                                                                                            b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                            Successive Band Reduction (BischofLangSun)

                                                                                            1

                                                                                            1

                                                                                            2

                                                                                            2Q1

                                                                                            Q1T

                                                                                            b+1

                                                                                            b+1

                                                                                            d+1

                                                                                            d+1

                                                                                            cd+c

                                                                                            d+c

                                                                                            d+c

                                                                                            d+c

                                                                                            c

                                                                                            b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                            Successive Band Reduction (BischofLangSun)

                                                                                            1

                                                                                            1

                                                                                            2

                                                                                            2

                                                                                            3

                                                                                            3

                                                                                            Q1

                                                                                            Q1T

                                                                                            Q2

                                                                                            Q2T

                                                                                            b+1

                                                                                            b+1

                                                                                            d+1

                                                                                            d+1

                                                                                            d+c

                                                                                            d+c

                                                                                            d+c

                                                                                            d+c

                                                                                            c

                                                                                            c

                                                                                            b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                            Successive Band Reduction (BischofLangSun)

                                                                                            1

                                                                                            1

                                                                                            2

                                                                                            2

                                                                                            3

                                                                                            3

                                                                                            4

                                                                                            4

                                                                                            Q1

                                                                                            Q1T

                                                                                            Q2

                                                                                            Q2T

                                                                                            Q3

                                                                                            Q3T

                                                                                            b+1

                                                                                            b+1

                                                                                            d+1

                                                                                            d+1

                                                                                            d+c

                                                                                            d+c

                                                                                            d+c

                                                                                            d+c

                                                                                            c

                                                                                            c

                                                                                            b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                            Successive Band Reduction (BischofLangSun)

                                                                                            1

                                                                                            1

                                                                                            2

                                                                                            2

                                                                                            3

                                                                                            3

                                                                                            4

                                                                                            4

                                                                                            5

                                                                                            5

                                                                                            Q1

                                                                                            Q1T

                                                                                            Q2

                                                                                            Q2T

                                                                                            Q3

                                                                                            Q3T

                                                                                            Q4

                                                                                            Q4T

                                                                                            b+1

                                                                                            b+1

                                                                                            d+1

                                                                                            d+1

                                                                                            c

                                                                                            c

                                                                                            d+c

                                                                                            d+c

                                                                                            d+c

                                                                                            d+c

                                                                                            b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                            Successive Band Reduction (BischofLangSun)

                                                                                            1

                                                                                            1

                                                                                            2

                                                                                            2

                                                                                            3

                                                                                            3

                                                                                            4

                                                                                            4

                                                                                            5

                                                                                            5

                                                                                            Q5T

                                                                                            Q1

                                                                                            Q1T

                                                                                            Q2

                                                                                            Q2T

                                                                                            Q3

                                                                                            Q3T

                                                                                            Q5

                                                                                            Q4

                                                                                            Q4T

                                                                                            b+1

                                                                                            b+1

                                                                                            d+1

                                                                                            d+1

                                                                                            c

                                                                                            c

                                                                                            d+c

                                                                                            d+c

                                                                                            d+c

                                                                                            d+c

                                                                                            b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                            Successive Band Reduction (BischofLangSun)

                                                                                            1

                                                                                            1

                                                                                            2

                                                                                            2

                                                                                            3

                                                                                            3

                                                                                            4

                                                                                            4

                                                                                            5

                                                                                            5

                                                                                            6

                                                                                            6

                                                                                            Q5T

                                                                                            Q1

                                                                                            Q1T

                                                                                            Q2

                                                                                            Q2T

                                                                                            Q3

                                                                                            Q3T

                                                                                            Q5

                                                                                            Q4

                                                                                            Q4T

                                                                                            b+1

                                                                                            b+1

                                                                                            d+1

                                                                                            d+1

                                                                                            c

                                                                                            c

                                                                                            d+c

                                                                                            d+c

                                                                                            d+c

                                                                                            d+c

                                                                                            b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                            Successive Band Reduction (BischofLangSun)

                                                                                            Conventional vs CA - SBR

                                                                                            Conventional Communication-Avoiding

                                                                                            Touch all data 4 times Touch all data once

                                                                                            >
                                                                                            >

                                                                                            Speedups of Sym Band Reductionvs DSBTRD

                                                                                            bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

                                                                                            bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

                                                                                            bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

                                                                                            bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

                                                                                            bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

                                                                                            Nonsymmetric Eigenproblem

                                                                                            bull No apparent way to modify standard algorithmbull Instead Spectral Divide-and-Conquer

                                                                                            ndash Find orthogonal matrix Q whose leading columns span an invariant subspace of A

                                                                                            ndash QTAQ will be block upper triangularndash Apply recursively to A11 A22

                                                                                            ndash Depends on randomization1 Randomized Rank Revealing QR decomposition2 Randomized location to try splitting spectrum

                                                                                            A11 A12

                                                                                            ε A22

                                                                                            Attaining the Lower bounds SequentialLegend[Existing][Ours][Math-Lib][Random]

                                                                                            Two Levels Memory Hierarchy

                                                                                            Words Messages Words Messages

                                                                                            BLAS-3 [FLPRrsquo99][BDLSTrsquo13][MKL etc] [FLPRrsquo99][BDLSTrsquo13][MKL etc]

                                                                                            Cholesky[Grsquo97][APrsquo00]

                                                                                            [LAPACK][BDHSrsquo09]

                                                                                            [Grsquo97][APrsquo00][BDHSrsquo09] [Grsquo97][APrsquo00][BDHSrsquo09]

                                                                                            Sym Indefinite [BBDDDPSTYrsquo13] [BBDDDPSTYrsquo13]

                                                                                            LU[Grsquo97][Trsquo97]

                                                                                            [GDXrsquo11][BDLSTrsquo13]

                                                                                            [GDXrsquo11][BDLSTrsquo13]

                                                                                            [Grsquo97][Trsquo97] [BDLSTrsquo13] [BDLSTrsquo13]

                                                                                            QR[EGrsquo98][FWrsquo03]

                                                                                            [DGHLrsquo12][BDLSTrsquo13]

                                                                                            [FWrsquo03][DGHLrsquo12][BDLSTrsquo13]

                                                                                            [EGrsquo98][FWrsquo03][BDLSTrsquo13]

                                                                                            [FWrsquo03][BDLSTrsquo13]

                                                                                            Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

                                                                                            Sym Eig amp SVD [BDDrsquo11][BDKrsquo13] [BDDrsquo11]

                                                                                            Non Sym Eig [BDDrsquo11] [BDDrsquo11]

                                                                                            Legend[Existing][Ours][Math-Lib][Random]

                                                                                            Words (BW) Messages (L) Saving factor

                                                                                            BLAS-3 [AGZrsquo94][MTrsquo99][ScaLAPACK][Crsquo69][vGWrsquo97][SDrsquo11] L nP12

                                                                                            Cholesky [ScaLAPACK][Trsquo99][SDrsquo11] L nP12

                                                                                            Sym Indefinite [BBDDDPSTYrsquo13][ScaLAPACK] [BBDDDPSTYrsquo13] L nP12

                                                                                            LU [ScaLAPACK][GDXrsquo11][Trsquo99][SDrsquo11] [GDXrsquo11][Trsquo99][SDrsquo11] L nP12

                                                                                            QR [ScaLAPACK][DGHLrsquo12] [Trsquo99] [DGHLrsquo12][Trsquo99] L nP12

                                                                                            Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

                                                                                            Sym Eig amp SVD [BDDrsquo11][BDKrsquo13][ScaLAPACK] [BDDrsquo11][BDKrsquo13] L nP12

                                                                                            Non-Sym Eig [BDDrsquo11] [BDDrsquo11] BW P12 L n

                                                                                            Attaining with extra memory 25D M=(cn2P)

                                                                                            Attaining the Lower bounds Parallel 2DM=(n2P)(Ignoring poly-log(P) factors words = ( n2 P12) messages = (P12)

                                                                                            Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                            ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                            ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                            bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                            bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                            Avoiding Communication in Iterative Linear Algebra

                                                                                            bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

                                                                                            bull Conjugate Gradients (CG) GMRES Lanczos Arnoldi hellip bull Goal minimize communication

                                                                                            ndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

                                                                                            bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

                                                                                            ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

                                                                                            bull Lots of speed up possible (modeled and measured)ndash Price some redundant computationndash Challenges Poor partitioning Preconditioning Num Stability

                                                                                            75

                                                                                            Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                            ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                            ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                            bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                            bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                            Example The Difficulty of Tuning SpMV

                                                                                            bull n = 21200bull nnz = 15 M

                                                                                            bull Source NASA structural analysis problem (raefsky)

                                                                                            77

                                                                                            Example The Difficulty of Tuning

                                                                                            bull n = 21200bull nnz = 15 M

                                                                                            bull Source NASA structural analysis problem (raefsky)

                                                                                            bull 8x8 dense substructure exploit this to limit mem_refs

                                                                                            78

                                                                                            Speedups on Itanium 2 The Need for Search

                                                                                            Reference

                                                                                            Best 4x2

                                                                                            Mflops

                                                                                            Mflops

                                                                                            79

                                                                                            Register Profile Itanium 2

                                                                                            190 Mflops

                                                                                            1190 Mflops

                                                                                            80

                                                                                            Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16

                                                                                            Itanium 2 - 33Itanium 1 - 8

                                                                                            252 Mflops

                                                                                            122 Mflops

                                                                                            820 Mflops

                                                                                            459 Mflops

                                                                                            247 Mflops

                                                                                            107 Mflops

                                                                                            12 Gflops

                                                                                            190 Mflops

                                                                                            Another example of tuning challenges for SpMV

                                                                                            bull Ex11 matrix (fluid flow)

                                                                                            bull More complicated non-zero structure in general

                                                                                            bull N = 16614bull NNZ = 11M

                                                                                            82

                                                                                            Zoom in to top corner

                                                                                            bull More complicated non-zero structure in general

                                                                                            bull N = 16614bull NNZ = 11M

                                                                                            83

                                                                                            3x3 blocks look natural buthellip

                                                                                            bull Example 3x3 blockingndash Logical grid of 3x3 cells

                                                                                            bull But would lead to lots of ldquofill-inrdquo

                                                                                            84

                                                                                            Extra Work Can Improve Efficiency

                                                                                            bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

                                                                                            bull On Pentium III 15x speedupndash Actual mflop rate 152 = 225 higher

                                                                                            85

                                                                                            Source Accelerator Cavity Design Problem (Ko via Husbands)

                                                                                            86

                                                                                            100x100 Submatrix Along Diagonal

                                                                                            Summer School Lecture 787

                                                                                            Post-RCM Reordering

                                                                                            88

                                                                                            Effect of Combined RCM+TSP Reordering

                                                                                            Before Green + RedAfter Green + Blue

                                                                                            Summer School Lecture 789

                                                                                            2x speedups on Pentium 4 Power 4 hellip

                                                                                            Summary of Other Performance Optimizations

                                                                                            bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

                                                                                            bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

                                                                                            bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

                                                                                            90

                                                                                            Optimized Sparse Kernel Interface - OSKI

                                                                                            bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

                                                                                            bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

                                                                                            bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

                                                                                            software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

                                                                                            91

                                                                                            Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                            ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                            ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                            bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                            bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                            93

                                                                                            Example Classical Conjugate Gradient (CG)

                                                                                            SpMVs and dot products require communication in

                                                                                            each iteration

                                                                                            via CA Matrix Powers Kernel

                                                                                            Global reduction to compute G

                                                                                            94

                                                                                            Example CA-Conjugate Gradient

                                                                                            Local computations within inner loop require

                                                                                            no communication

                                                                                            Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                            ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                            ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                            bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                            bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                            96

                                                                                            Slower convergence due

                                                                                            to roundoff

                                                                                            Loss of accuracy due to roundoff

                                                                                            At s = 16 monomial basis is rank deficient Method breaks down

                                                                                            Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

                                                                                            CA-CG (monomial)CG

                                                                                            machine precision

                                                                                            97

                                                                                            Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                            ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                            ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                            bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                            bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                            What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

                                                                                            bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

                                                                                            bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

                                                                                            matrices

                                                                                            Explicit (O(nnz)) Implicit (o(nnz))

                                                                                            Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

                                                                                            Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

                                                                                            Indices

                                                                                            Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                            ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                            ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                            bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                            bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                            101

                                                                                            bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

                                                                                            ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

                                                                                            demanded by customers (construction engineers) otherwise they donrsquot believe results

                                                                                            ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

                                                                                            ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

                                                                                            Reproducible Floating Point Computation

                                                                                            Absolute Error for Random Vectors

                                                                                            Same magnitude opposite signs

                                                                                            Intel MKL non-reproducibility

                                                                                            Relative Error for Orthogonal vectors

                                                                                            Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

                                                                                            Sign notreproducible

                                                                                            103

                                                                                            bull Consider summation or dot productbull Goals

                                                                                            1 Same answer independent of layout processors order of summands

                                                                                            2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

                                                                                            bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

                                                                                            GoalsApproaches for Reproducibility

                                                                                            104

                                                                                            Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

                                                                                            Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

                                                                                            Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

                                                                                            bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

                                                                                            Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

                                                                                            bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

                                                                                            Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

                                                                                            Instruments NEC Nokia NVIDIA Samsung Oracle

                                                                                            bull bebopcsberkeleyedu

                                                                                            Summary

                                                                                            Donrsquot Communichellip

                                                                                            106

                                                                                            Time to redesign all linear algebra n-body hellip algorithms and software

                                                                                            (and compilers)

                                                                                            • Implementing Communication-Avoiding Algorithms
                                                                                            • Why avoid communication
                                                                                            • Goals
                                                                                            • Outline
                                                                                            • Outline (2)
                                                                                            • Lower bound for all ldquon3-likerdquo linear algebra
                                                                                            • Lower bound for all ldquon3-likerdquo linear algebra (2)
                                                                                            • Lower bound for all ldquon3-likerdquo linear algebra (3)
                                                                                            • Limits to parallel scaling (12)
                                                                                            • Limits to parallel scaling (22)
                                                                                            • Can we attain these lower bounds
                                                                                            • Outline (3)
                                                                                            • 25D Matrix Multiplication
                                                                                            • 25D Matrix Multiplication (2)
                                                                                            • 25D Matmul on BGP 16K nodes 64K cores (2)
                                                                                            • Perfect Strong Scaling ndash in Time and Energy (12)
                                                                                            • Perfect Strong Scaling ndash in Time and Energy (22)
                                                                                            • Handling Heterogeneity
                                                                                            • Application to Tensor Contractions
                                                                                            • C(ijk) = Σm A(ijm)B(mk)
                                                                                            • Application to Tensor Contractions (2)
                                                                                            • Communication Lower Bounds for Strassen-like matmul algorithms
                                                                                            • vs
                                                                                            • Slide 26
                                                                                            • Strassen-like beyond matmul
                                                                                            • Cache and Network Oblivious Algorithms
                                                                                            • CARMA Performance Distributed Memory
                                                                                            • CARMA Performance Distributed Memory (2)
                                                                                            • CARMA Performance Shared Memory
                                                                                            • CARMA Performance Shared Memory (2)
                                                                                            • Why is CARMA Faster in Shared Memory
                                                                                            • Outline (4)
                                                                                            • One-sided Factorizations (LU QR) so far
                                                                                            • TSQR An Architecture-Dependent Algorithm
                                                                                            • Back to LU Using similar idea for TSLU as TSQR Use reduction
                                                                                            • Minimizing Communication in TSLU
                                                                                            • Making TSLU Numerically Stable
                                                                                            • Stability of LU using TSLU CALU
                                                                                            • Why is stability of TSLU just a ldquoThmrdquo
                                                                                            • Fixing TSLU
                                                                                            • 2D CALU with Tournament Pivoting
                                                                                            • 25D CALU with Tournament Pivoting (c=4 copies)
                                                                                            • Exascale Machine Parameters Source DOE Exascale Workshop
                                                                                            • Exascale predicted speedups for Gaussian Elimination 2D CA
                                                                                            • 25D vs 2D LU With and Without Pivoting
                                                                                            • Other CA algorithms for Ax=b least squares(13)
                                                                                            • Other CA algorithms for Ax=b least squares (23)
                                                                                            • Other CA algorithms for Ax=b least squares (33)
                                                                                            • Outline (5)
                                                                                            • What about sparse matrices (13)
                                                                                            • Performance of 25D APSP using Kleene
                                                                                            • What about sparse matrices (23)
                                                                                            • What about sparse matrices (33)
                                                                                            • Outline (6)
                                                                                            • Symmetric Eigenproblem and SVD
                                                                                            • Slide 58
                                                                                            • Slide 59
                                                                                            • Slide 60
                                                                                            • Slide 61
                                                                                            • Slide 62
                                                                                            • Slide 63
                                                                                            • Slide 64
                                                                                            • Slide 65
                                                                                            • Slide 66
                                                                                            • Slide 67
                                                                                            • Slide 68
                                                                                            • Conventional vs CA - SBR
                                                                                            • Speedups of Sym Band Reduction vs DSBTRD
                                                                                            • Nonsymmetric Eigenproblem
                                                                                            • Attaining the Lower bounds Sequential
                                                                                            • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
                                                                                            • Outline (7)
                                                                                            • Avoiding Communication in Iterative Linear Algebra
                                                                                            • Outline (8)
                                                                                            • Example The Difficulty of Tuning SpMV
                                                                                            • Example The Difficulty of Tuning
                                                                                            • Speedups on Itanium 2 The Need for Search
                                                                                            • Register Profile Itanium 2
                                                                                            • Register Profiles IBM and Intel IA-64
                                                                                            • Another example of tuning challenges for SpMV
                                                                                            • Zoom in to top corner
                                                                                            • 3x3 blocks look natural buthellip
                                                                                            • Extra Work Can Improve Efficiency
                                                                                            • Slide 86
                                                                                            • Slide 87
                                                                                            • Slide 88
                                                                                            • Slide 89
                                                                                            • Summary of Other Performance Optimizations
                                                                                            • Optimized Sparse Kernel Interface - OSKI
                                                                                            • Outline (9)
                                                                                            • Example Classical Conjugate Gradient (CG)
                                                                                            • Example CA-Conjugate Gradient
                                                                                            • Outline (10)
                                                                                            • Slide 96
                                                                                            • Slide 97
                                                                                            • Outline (11)
                                                                                            • What is a ldquosparse matrixrdquo
                                                                                            • Outline (12)
                                                                                            • Reproducible Floating Point Computation
                                                                                            • Intel MKL non-reproducibility
                                                                                            • GoalsApproaches for Reproducibility
                                                                                            • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
                                                                                            • Collaborators and Supporters
                                                                                            • Summary

                                                                                              Other CA algorithms for Ax=b least squares (23)bull Minimizing bandwidth and latency for sequential GEPP

                                                                                              ndash So far could not do partial pivoting and minimize messages just words

                                                                                              ndash Challengebull Column layout good for choosing pivots bad for matmulbull Blocked layout good for matmul bad for choosing pivots

                                                                                              ndash Solution use both layouts switching between thembull ldquoShape Morphing LUrdquo or SMLU

                                                                                              49

                                                                                              bull func factor(A) if A has 1 column update it else factor(left half of A)

                                                                                              update right half of A

                                                                                              factor(right half of A)

                                                                                              bull Words = O(n3M12)

                                                                                              bull Messages = O(n3M)

                                                                                              bull func factor(A) if A has 1 column update it else factor(left half of A) reshape to recursive block format update right half of A reshape to columnwise format factor(right half of A)

                                                                                              bull Words = O(n3M12)

                                                                                              bull Messages = O(n3M32)

                                                                                              Other CA algorithms for Ax=b least squares (33)bull Need for pivoting arises beyond LU in QR

                                                                                              ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)

                                                                                              ndash Usual approach like Partial Pivotingbull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound

                                                                                              ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two

                                                                                              groups of b columns either using usual approach or something better (GuEisenstat)

                                                                                              bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix

                                                                                              ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 50

                                                                                              Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                              ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                              ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                              bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                              bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                              What about sparse matrices (13)

                                                                                              bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

                                                                                              bull But canrsquot reorder outer loop for 25D need another ideabull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = AB

                                                                                              ndash Dependencies ok 25D works just different semiringbull Kleenersquos Algorithm

                                                                                              52

                                                                                              for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

                                                                                              D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

                                                                                              Performance of 25D APSP using Kleene

                                                                                              53

                                                                                              Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

                                                                                              62xspeedup

                                                                                              2x speedup

                                                                                              What about sparse matrices (23)

                                                                                              bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of G(A)

                                                                                              have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

                                                                                              bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering for

                                                                                              2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

                                                                                              bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

                                                                                              multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

                                                                                              separators)

                                                                                              54

                                                                                              What about sparse matrices (33)

                                                                                              bull If matrix stays very sparse lower bound unattainable new one bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

                                                                                              data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

                                                                                              bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

                                                                                              Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

                                                                                              bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

                                                                                              along dimensions most likely to minimize cost55

                                                                                              Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                              ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                              ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                              bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                              bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                              Symmetric Eigenproblem and SVD

                                                                                              bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd

                                                                                              bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12

                                                                                              ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea

                                                                                              b+1

                                                                                              b+1

                                                                                              Successive Band Reduction (BischofLangSun)

                                                                                              1

                                                                                              b+1

                                                                                              b+1

                                                                                              d+1

                                                                                              c

                                                                                              Successive Band Reduction (BischofLangSun)

                                                                                              b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                              1Q1

                                                                                              b+1

                                                                                              b+1

                                                                                              d+1

                                                                                              c

                                                                                              b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                              Successive Band Reduction (BischofLangSun)

                                                                                              12

                                                                                              Q1

                                                                                              b+1

                                                                                              b+1

                                                                                              d+1

                                                                                              d+c

                                                                                              d+c

                                                                                              c

                                                                                              b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                              Successive Band Reduction (BischofLangSun)

                                                                                              1

                                                                                              12

                                                                                              Q1

                                                                                              Q1T

                                                                                              b+1

                                                                                              b+1

                                                                                              d+1

                                                                                              d+1

                                                                                              cd+c

                                                                                              d+c

                                                                                              c

                                                                                              b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                              Successive Band Reduction (BischofLangSun)

                                                                                              1

                                                                                              1

                                                                                              2

                                                                                              2Q1

                                                                                              Q1T

                                                                                              b+1

                                                                                              b+1

                                                                                              d+1

                                                                                              d+1

                                                                                              cd+c

                                                                                              d+c

                                                                                              d+c

                                                                                              d+c

                                                                                              c

                                                                                              b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                              Successive Band Reduction (BischofLangSun)

                                                                                              1

                                                                                              1

                                                                                              2

                                                                                              2

                                                                                              3

                                                                                              3

                                                                                              Q1

                                                                                              Q1T

                                                                                              Q2

                                                                                              Q2T

                                                                                              b+1

                                                                                              b+1

                                                                                              d+1

                                                                                              d+1

                                                                                              d+c

                                                                                              d+c

                                                                                              d+c

                                                                                              d+c

                                                                                              c

                                                                                              c

                                                                                              b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                              Successive Band Reduction (BischofLangSun)

                                                                                              1

                                                                                              1

                                                                                              2

                                                                                              2

                                                                                              3

                                                                                              3

                                                                                              4

                                                                                              4

                                                                                              Q1

                                                                                              Q1T

                                                                                              Q2

                                                                                              Q2T

                                                                                              Q3

                                                                                              Q3T

                                                                                              b+1

                                                                                              b+1

                                                                                              d+1

                                                                                              d+1

                                                                                              d+c

                                                                                              d+c

                                                                                              d+c

                                                                                              d+c

                                                                                              c

                                                                                              c

                                                                                              b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                              Successive Band Reduction (BischofLangSun)

                                                                                              1

                                                                                              1

                                                                                              2

                                                                                              2

                                                                                              3

                                                                                              3

                                                                                              4

                                                                                              4

                                                                                              5

                                                                                              5

                                                                                              Q1

                                                                                              Q1T

                                                                                              Q2

                                                                                              Q2T

                                                                                              Q3

                                                                                              Q3T

                                                                                              Q4

                                                                                              Q4T

                                                                                              b+1

                                                                                              b+1

                                                                                              d+1

                                                                                              d+1

                                                                                              c

                                                                                              c

                                                                                              d+c

                                                                                              d+c

                                                                                              d+c

                                                                                              d+c

                                                                                              b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                              Successive Band Reduction (BischofLangSun)

                                                                                              1

                                                                                              1

                                                                                              2

                                                                                              2

                                                                                              3

                                                                                              3

                                                                                              4

                                                                                              4

                                                                                              5

                                                                                              5

                                                                                              Q5T

                                                                                              Q1

                                                                                              Q1T

                                                                                              Q2

                                                                                              Q2T

                                                                                              Q3

                                                                                              Q3T

                                                                                              Q5

                                                                                              Q4

                                                                                              Q4T

                                                                                              b+1

                                                                                              b+1

                                                                                              d+1

                                                                                              d+1

                                                                                              c

                                                                                              c

                                                                                              d+c

                                                                                              d+c

                                                                                              d+c

                                                                                              d+c

                                                                                              b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                              Successive Band Reduction (BischofLangSun)

                                                                                              1

                                                                                              1

                                                                                              2

                                                                                              2

                                                                                              3

                                                                                              3

                                                                                              4

                                                                                              4

                                                                                              5

                                                                                              5

                                                                                              6

                                                                                              6

                                                                                              Q5T

                                                                                              Q1

                                                                                              Q1T

                                                                                              Q2

                                                                                              Q2T

                                                                                              Q3

                                                                                              Q3T

                                                                                              Q5

                                                                                              Q4

                                                                                              Q4T

                                                                                              b+1

                                                                                              b+1

                                                                                              d+1

                                                                                              d+1

                                                                                              c

                                                                                              c

                                                                                              d+c

                                                                                              d+c

                                                                                              d+c

                                                                                              d+c

                                                                                              b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                              Successive Band Reduction (BischofLangSun)

                                                                                              Conventional vs CA - SBR

                                                                                              Conventional Communication-Avoiding

                                                                                              Touch all data 4 times Touch all data once

                                                                                              >
                                                                                              >

                                                                                              Speedups of Sym Band Reductionvs DSBTRD

                                                                                              bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

                                                                                              bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

                                                                                              bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

                                                                                              bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

                                                                                              bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

                                                                                              Nonsymmetric Eigenproblem

                                                                                              bull No apparent way to modify standard algorithmbull Instead Spectral Divide-and-Conquer

                                                                                              ndash Find orthogonal matrix Q whose leading columns span an invariant subspace of A

                                                                                              ndash QTAQ will be block upper triangularndash Apply recursively to A11 A22

                                                                                              ndash Depends on randomization1 Randomized Rank Revealing QR decomposition2 Randomized location to try splitting spectrum

                                                                                              A11 A12

                                                                                              ε A22

                                                                                              Attaining the Lower bounds SequentialLegend[Existing][Ours][Math-Lib][Random]

                                                                                              Two Levels Memory Hierarchy

                                                                                              Words Messages Words Messages

                                                                                              BLAS-3 [FLPRrsquo99][BDLSTrsquo13][MKL etc] [FLPRrsquo99][BDLSTrsquo13][MKL etc]

                                                                                              Cholesky[Grsquo97][APrsquo00]

                                                                                              [LAPACK][BDHSrsquo09]

                                                                                              [Grsquo97][APrsquo00][BDHSrsquo09] [Grsquo97][APrsquo00][BDHSrsquo09]

                                                                                              Sym Indefinite [BBDDDPSTYrsquo13] [BBDDDPSTYrsquo13]

                                                                                              LU[Grsquo97][Trsquo97]

                                                                                              [GDXrsquo11][BDLSTrsquo13]

                                                                                              [GDXrsquo11][BDLSTrsquo13]

                                                                                              [Grsquo97][Trsquo97] [BDLSTrsquo13] [BDLSTrsquo13]

                                                                                              QR[EGrsquo98][FWrsquo03]

                                                                                              [DGHLrsquo12][BDLSTrsquo13]

                                                                                              [FWrsquo03][DGHLrsquo12][BDLSTrsquo13]

                                                                                              [EGrsquo98][FWrsquo03][BDLSTrsquo13]

                                                                                              [FWrsquo03][BDLSTrsquo13]

                                                                                              Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

                                                                                              Sym Eig amp SVD [BDDrsquo11][BDKrsquo13] [BDDrsquo11]

                                                                                              Non Sym Eig [BDDrsquo11] [BDDrsquo11]

                                                                                              Legend[Existing][Ours][Math-Lib][Random]

                                                                                              Words (BW) Messages (L) Saving factor

                                                                                              BLAS-3 [AGZrsquo94][MTrsquo99][ScaLAPACK][Crsquo69][vGWrsquo97][SDrsquo11] L nP12

                                                                                              Cholesky [ScaLAPACK][Trsquo99][SDrsquo11] L nP12

                                                                                              Sym Indefinite [BBDDDPSTYrsquo13][ScaLAPACK] [BBDDDPSTYrsquo13] L nP12

                                                                                              LU [ScaLAPACK][GDXrsquo11][Trsquo99][SDrsquo11] [GDXrsquo11][Trsquo99][SDrsquo11] L nP12

                                                                                              QR [ScaLAPACK][DGHLrsquo12] [Trsquo99] [DGHLrsquo12][Trsquo99] L nP12

                                                                                              Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

                                                                                              Sym Eig amp SVD [BDDrsquo11][BDKrsquo13][ScaLAPACK] [BDDrsquo11][BDKrsquo13] L nP12

                                                                                              Non-Sym Eig [BDDrsquo11] [BDDrsquo11] BW P12 L n

                                                                                              Attaining with extra memory 25D M=(cn2P)

                                                                                              Attaining the Lower bounds Parallel 2DM=(n2P)(Ignoring poly-log(P) factors words = ( n2 P12) messages = (P12)

                                                                                              Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                              ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                              ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                              bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                              bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                              Avoiding Communication in Iterative Linear Algebra

                                                                                              bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

                                                                                              bull Conjugate Gradients (CG) GMRES Lanczos Arnoldi hellip bull Goal minimize communication

                                                                                              ndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

                                                                                              bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

                                                                                              ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

                                                                                              bull Lots of speed up possible (modeled and measured)ndash Price some redundant computationndash Challenges Poor partitioning Preconditioning Num Stability

                                                                                              75

                                                                                              Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                              ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                              ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                              bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                              bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                              Example The Difficulty of Tuning SpMV

                                                                                              bull n = 21200bull nnz = 15 M

                                                                                              bull Source NASA structural analysis problem (raefsky)

                                                                                              77

                                                                                              Example The Difficulty of Tuning

                                                                                              bull n = 21200bull nnz = 15 M

                                                                                              bull Source NASA structural analysis problem (raefsky)

                                                                                              bull 8x8 dense substructure exploit this to limit mem_refs

                                                                                              78

                                                                                              Speedups on Itanium 2 The Need for Search

                                                                                              Reference

                                                                                              Best 4x2

                                                                                              Mflops

                                                                                              Mflops

                                                                                              79

                                                                                              Register Profile Itanium 2

                                                                                              190 Mflops

                                                                                              1190 Mflops

                                                                                              80

                                                                                              Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16

                                                                                              Itanium 2 - 33Itanium 1 - 8

                                                                                              252 Mflops

                                                                                              122 Mflops

                                                                                              820 Mflops

                                                                                              459 Mflops

                                                                                              247 Mflops

                                                                                              107 Mflops

                                                                                              12 Gflops

                                                                                              190 Mflops

                                                                                              Another example of tuning challenges for SpMV

                                                                                              bull Ex11 matrix (fluid flow)

                                                                                              bull More complicated non-zero structure in general

                                                                                              bull N = 16614bull NNZ = 11M

                                                                                              82

                                                                                              Zoom in to top corner

                                                                                              bull More complicated non-zero structure in general

                                                                                              bull N = 16614bull NNZ = 11M

                                                                                              83

                                                                                              3x3 blocks look natural buthellip

                                                                                              bull Example 3x3 blockingndash Logical grid of 3x3 cells

                                                                                              bull But would lead to lots of ldquofill-inrdquo

                                                                                              84

                                                                                              Extra Work Can Improve Efficiency

                                                                                              bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

                                                                                              bull On Pentium III 15x speedupndash Actual mflop rate 152 = 225 higher

                                                                                              85

                                                                                              Source Accelerator Cavity Design Problem (Ko via Husbands)

                                                                                              86

                                                                                              100x100 Submatrix Along Diagonal

                                                                                              Summer School Lecture 787

                                                                                              Post-RCM Reordering

                                                                                              88

                                                                                              Effect of Combined RCM+TSP Reordering

                                                                                              Before Green + RedAfter Green + Blue

                                                                                              Summer School Lecture 789

                                                                                              2x speedups on Pentium 4 Power 4 hellip

                                                                                              Summary of Other Performance Optimizations

                                                                                              bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

                                                                                              bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

                                                                                              bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

                                                                                              90

                                                                                              Optimized Sparse Kernel Interface - OSKI

                                                                                              bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

                                                                                              bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

                                                                                              bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

                                                                                              software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

                                                                                              91

                                                                                              Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                              ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                              ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                              bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                              bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                              93

                                                                                              Example Classical Conjugate Gradient (CG)

                                                                                              SpMVs and dot products require communication in

                                                                                              each iteration

                                                                                              via CA Matrix Powers Kernel

                                                                                              Global reduction to compute G

                                                                                              94

                                                                                              Example CA-Conjugate Gradient

                                                                                              Local computations within inner loop require

                                                                                              no communication

                                                                                              Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                              ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                              ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                              bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                              bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                              96

                                                                                              Slower convergence due

                                                                                              to roundoff

                                                                                              Loss of accuracy due to roundoff

                                                                                              At s = 16 monomial basis is rank deficient Method breaks down

                                                                                              Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

                                                                                              CA-CG (monomial)CG

                                                                                              machine precision

                                                                                              97

                                                                                              Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                              ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                              ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                              bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                              bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                              What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

                                                                                              bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

                                                                                              bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

                                                                                              matrices

                                                                                              Explicit (O(nnz)) Implicit (o(nnz))

                                                                                              Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

                                                                                              Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

                                                                                              Indices

                                                                                              Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                              ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                              ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                              bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                              bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                              101

                                                                                              bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

                                                                                              ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

                                                                                              demanded by customers (construction engineers) otherwise they donrsquot believe results

                                                                                              ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

                                                                                              ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

                                                                                              Reproducible Floating Point Computation

                                                                                              Absolute Error for Random Vectors

                                                                                              Same magnitude opposite signs

                                                                                              Intel MKL non-reproducibility

                                                                                              Relative Error for Orthogonal vectors

                                                                                              Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

                                                                                              Sign notreproducible

                                                                                              103

                                                                                              bull Consider summation or dot productbull Goals

                                                                                              1 Same answer independent of layout processors order of summands

                                                                                              2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

                                                                                              bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

                                                                                              GoalsApproaches for Reproducibility

                                                                                              104

                                                                                              Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

                                                                                              Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

                                                                                              Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

                                                                                              bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

                                                                                              Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

                                                                                              bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

                                                                                              Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

                                                                                              Instruments NEC Nokia NVIDIA Samsung Oracle

                                                                                              bull bebopcsberkeleyedu

                                                                                              Summary

                                                                                              Donrsquot Communichellip

                                                                                              106

                                                                                              Time to redesign all linear algebra n-body hellip algorithms and software

                                                                                              (and compilers)

                                                                                              • Implementing Communication-Avoiding Algorithms
                                                                                              • Why avoid communication
                                                                                              • Goals
                                                                                              • Outline
                                                                                              • Outline (2)
                                                                                              • Lower bound for all ldquon3-likerdquo linear algebra
                                                                                              • Lower bound for all ldquon3-likerdquo linear algebra (2)
                                                                                              • Lower bound for all ldquon3-likerdquo linear algebra (3)
                                                                                              • Limits to parallel scaling (12)
                                                                                              • Limits to parallel scaling (22)
                                                                                              • Can we attain these lower bounds
                                                                                              • Outline (3)
                                                                                              • 25D Matrix Multiplication
                                                                                              • 25D Matrix Multiplication (2)
                                                                                              • 25D Matmul on BGP 16K nodes 64K cores (2)
                                                                                              • Perfect Strong Scaling ndash in Time and Energy (12)
                                                                                              • Perfect Strong Scaling ndash in Time and Energy (22)
                                                                                              • Handling Heterogeneity
                                                                                              • Application to Tensor Contractions
                                                                                              • C(ijk) = Σm A(ijm)B(mk)
                                                                                              • Application to Tensor Contractions (2)
                                                                                              • Communication Lower Bounds for Strassen-like matmul algorithms
                                                                                              • vs
                                                                                              • Slide 26
                                                                                              • Strassen-like beyond matmul
                                                                                              • Cache and Network Oblivious Algorithms
                                                                                              • CARMA Performance Distributed Memory
                                                                                              • CARMA Performance Distributed Memory (2)
                                                                                              • CARMA Performance Shared Memory
                                                                                              • CARMA Performance Shared Memory (2)
                                                                                              • Why is CARMA Faster in Shared Memory
                                                                                              • Outline (4)
                                                                                              • One-sided Factorizations (LU QR) so far
                                                                                              • TSQR An Architecture-Dependent Algorithm
                                                                                              • Back to LU Using similar idea for TSLU as TSQR Use reduction
                                                                                              • Minimizing Communication in TSLU
                                                                                              • Making TSLU Numerically Stable
                                                                                              • Stability of LU using TSLU CALU
                                                                                              • Why is stability of TSLU just a ldquoThmrdquo
                                                                                              • Fixing TSLU
                                                                                              • 2D CALU with Tournament Pivoting
                                                                                              • 25D CALU with Tournament Pivoting (c=4 copies)
                                                                                              • Exascale Machine Parameters Source DOE Exascale Workshop
                                                                                              • Exascale predicted speedups for Gaussian Elimination 2D CA
                                                                                              • 25D vs 2D LU With and Without Pivoting
                                                                                              • Other CA algorithms for Ax=b least squares(13)
                                                                                              • Other CA algorithms for Ax=b least squares (23)
                                                                                              • Other CA algorithms for Ax=b least squares (33)
                                                                                              • Outline (5)
                                                                                              • What about sparse matrices (13)
                                                                                              • Performance of 25D APSP using Kleene
                                                                                              • What about sparse matrices (23)
                                                                                              • What about sparse matrices (33)
                                                                                              • Outline (6)
                                                                                              • Symmetric Eigenproblem and SVD
                                                                                              • Slide 58
                                                                                              • Slide 59
                                                                                              • Slide 60
                                                                                              • Slide 61
                                                                                              • Slide 62
                                                                                              • Slide 63
                                                                                              • Slide 64
                                                                                              • Slide 65
                                                                                              • Slide 66
                                                                                              • Slide 67
                                                                                              • Slide 68
                                                                                              • Conventional vs CA - SBR
                                                                                              • Speedups of Sym Band Reduction vs DSBTRD
                                                                                              • Nonsymmetric Eigenproblem
                                                                                              • Attaining the Lower bounds Sequential
                                                                                              • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
                                                                                              • Outline (7)
                                                                                              • Avoiding Communication in Iterative Linear Algebra
                                                                                              • Outline (8)
                                                                                              • Example The Difficulty of Tuning SpMV
                                                                                              • Example The Difficulty of Tuning
                                                                                              • Speedups on Itanium 2 The Need for Search
                                                                                              • Register Profile Itanium 2
                                                                                              • Register Profiles IBM and Intel IA-64
                                                                                              • Another example of tuning challenges for SpMV
                                                                                              • Zoom in to top corner
                                                                                              • 3x3 blocks look natural buthellip
                                                                                              • Extra Work Can Improve Efficiency
                                                                                              • Slide 86
                                                                                              • Slide 87
                                                                                              • Slide 88
                                                                                              • Slide 89
                                                                                              • Summary of Other Performance Optimizations
                                                                                              • Optimized Sparse Kernel Interface - OSKI
                                                                                              • Outline (9)
                                                                                              • Example Classical Conjugate Gradient (CG)
                                                                                              • Example CA-Conjugate Gradient
                                                                                              • Outline (10)
                                                                                              • Slide 96
                                                                                              • Slide 97
                                                                                              • Outline (11)
                                                                                              • What is a ldquosparse matrixrdquo
                                                                                              • Outline (12)
                                                                                              • Reproducible Floating Point Computation
                                                                                              • Intel MKL non-reproducibility
                                                                                              • GoalsApproaches for Reproducibility
                                                                                              • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
                                                                                              • Collaborators and Supporters
                                                                                              • Summary

                                                                                                Other CA algorithms for Ax=b least squares (33)bull Need for pivoting arises beyond LU in QR

                                                                                                ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)

                                                                                                ndash Usual approach like Partial Pivotingbull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound

                                                                                                ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two

                                                                                                groups of b columns either using usual approach or something better (GuEisenstat)

                                                                                                bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix

                                                                                                ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 50

                                                                                                Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                What about sparse matrices (13)

                                                                                                bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

                                                                                                bull But canrsquot reorder outer loop for 25D need another ideabull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = AB

                                                                                                ndash Dependencies ok 25D works just different semiringbull Kleenersquos Algorithm

                                                                                                52

                                                                                                for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

                                                                                                D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

                                                                                                Performance of 25D APSP using Kleene

                                                                                                53

                                                                                                Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

                                                                                                62xspeedup

                                                                                                2x speedup

                                                                                                What about sparse matrices (23)

                                                                                                bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of G(A)

                                                                                                have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

                                                                                                bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering for

                                                                                                2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

                                                                                                bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

                                                                                                multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

                                                                                                separators)

                                                                                                54

                                                                                                What about sparse matrices (33)

                                                                                                bull If matrix stays very sparse lower bound unattainable new one bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

                                                                                                data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

                                                                                                bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

                                                                                                Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

                                                                                                bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

                                                                                                along dimensions most likely to minimize cost55

                                                                                                Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                Symmetric Eigenproblem and SVD

                                                                                                bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd

                                                                                                bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12

                                                                                                ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea

                                                                                                b+1

                                                                                                b+1

                                                                                                Successive Band Reduction (BischofLangSun)

                                                                                                1

                                                                                                b+1

                                                                                                b+1

                                                                                                d+1

                                                                                                c

                                                                                                Successive Band Reduction (BischofLangSun)

                                                                                                b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                                1Q1

                                                                                                b+1

                                                                                                b+1

                                                                                                d+1

                                                                                                c

                                                                                                b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                                Successive Band Reduction (BischofLangSun)

                                                                                                12

                                                                                                Q1

                                                                                                b+1

                                                                                                b+1

                                                                                                d+1

                                                                                                d+c

                                                                                                d+c

                                                                                                c

                                                                                                b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                                Successive Band Reduction (BischofLangSun)

                                                                                                1

                                                                                                12

                                                                                                Q1

                                                                                                Q1T

                                                                                                b+1

                                                                                                b+1

                                                                                                d+1

                                                                                                d+1

                                                                                                cd+c

                                                                                                d+c

                                                                                                c

                                                                                                b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                                Successive Band Reduction (BischofLangSun)

                                                                                                1

                                                                                                1

                                                                                                2

                                                                                                2Q1

                                                                                                Q1T

                                                                                                b+1

                                                                                                b+1

                                                                                                d+1

                                                                                                d+1

                                                                                                cd+c

                                                                                                d+c

                                                                                                d+c

                                                                                                d+c

                                                                                                c

                                                                                                b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                                Successive Band Reduction (BischofLangSun)

                                                                                                1

                                                                                                1

                                                                                                2

                                                                                                2

                                                                                                3

                                                                                                3

                                                                                                Q1

                                                                                                Q1T

                                                                                                Q2

                                                                                                Q2T

                                                                                                b+1

                                                                                                b+1

                                                                                                d+1

                                                                                                d+1

                                                                                                d+c

                                                                                                d+c

                                                                                                d+c

                                                                                                d+c

                                                                                                c

                                                                                                c

                                                                                                b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                                Successive Band Reduction (BischofLangSun)

                                                                                                1

                                                                                                1

                                                                                                2

                                                                                                2

                                                                                                3

                                                                                                3

                                                                                                4

                                                                                                4

                                                                                                Q1

                                                                                                Q1T

                                                                                                Q2

                                                                                                Q2T

                                                                                                Q3

                                                                                                Q3T

                                                                                                b+1

                                                                                                b+1

                                                                                                d+1

                                                                                                d+1

                                                                                                d+c

                                                                                                d+c

                                                                                                d+c

                                                                                                d+c

                                                                                                c

                                                                                                c

                                                                                                b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                                Successive Band Reduction (BischofLangSun)

                                                                                                1

                                                                                                1

                                                                                                2

                                                                                                2

                                                                                                3

                                                                                                3

                                                                                                4

                                                                                                4

                                                                                                5

                                                                                                5

                                                                                                Q1

                                                                                                Q1T

                                                                                                Q2

                                                                                                Q2T

                                                                                                Q3

                                                                                                Q3T

                                                                                                Q4

                                                                                                Q4T

                                                                                                b+1

                                                                                                b+1

                                                                                                d+1

                                                                                                d+1

                                                                                                c

                                                                                                c

                                                                                                d+c

                                                                                                d+c

                                                                                                d+c

                                                                                                d+c

                                                                                                b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                                Successive Band Reduction (BischofLangSun)

                                                                                                1

                                                                                                1

                                                                                                2

                                                                                                2

                                                                                                3

                                                                                                3

                                                                                                4

                                                                                                4

                                                                                                5

                                                                                                5

                                                                                                Q5T

                                                                                                Q1

                                                                                                Q1T

                                                                                                Q2

                                                                                                Q2T

                                                                                                Q3

                                                                                                Q3T

                                                                                                Q5

                                                                                                Q4

                                                                                                Q4T

                                                                                                b+1

                                                                                                b+1

                                                                                                d+1

                                                                                                d+1

                                                                                                c

                                                                                                c

                                                                                                d+c

                                                                                                d+c

                                                                                                d+c

                                                                                                d+c

                                                                                                b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                                Successive Band Reduction (BischofLangSun)

                                                                                                1

                                                                                                1

                                                                                                2

                                                                                                2

                                                                                                3

                                                                                                3

                                                                                                4

                                                                                                4

                                                                                                5

                                                                                                5

                                                                                                6

                                                                                                6

                                                                                                Q5T

                                                                                                Q1

                                                                                                Q1T

                                                                                                Q2

                                                                                                Q2T

                                                                                                Q3

                                                                                                Q3T

                                                                                                Q5

                                                                                                Q4

                                                                                                Q4T

                                                                                                b+1

                                                                                                b+1

                                                                                                d+1

                                                                                                d+1

                                                                                                c

                                                                                                c

                                                                                                d+c

                                                                                                d+c

                                                                                                d+c

                                                                                                d+c

                                                                                                b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                                Successive Band Reduction (BischofLangSun)

                                                                                                Conventional vs CA - SBR

                                                                                                Conventional Communication-Avoiding

                                                                                                Touch all data 4 times Touch all data once

                                                                                                >
                                                                                                >

                                                                                                Speedups of Sym Band Reductionvs DSBTRD

                                                                                                bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

                                                                                                bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

                                                                                                bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

                                                                                                bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

                                                                                                bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

                                                                                                Nonsymmetric Eigenproblem

                                                                                                bull No apparent way to modify standard algorithmbull Instead Spectral Divide-and-Conquer

                                                                                                ndash Find orthogonal matrix Q whose leading columns span an invariant subspace of A

                                                                                                ndash QTAQ will be block upper triangularndash Apply recursively to A11 A22

                                                                                                ndash Depends on randomization1 Randomized Rank Revealing QR decomposition2 Randomized location to try splitting spectrum

                                                                                                A11 A12

                                                                                                ε A22

                                                                                                Attaining the Lower bounds SequentialLegend[Existing][Ours][Math-Lib][Random]

                                                                                                Two Levels Memory Hierarchy

                                                                                                Words Messages Words Messages

                                                                                                BLAS-3 [FLPRrsquo99][BDLSTrsquo13][MKL etc] [FLPRrsquo99][BDLSTrsquo13][MKL etc]

                                                                                                Cholesky[Grsquo97][APrsquo00]

                                                                                                [LAPACK][BDHSrsquo09]

                                                                                                [Grsquo97][APrsquo00][BDHSrsquo09] [Grsquo97][APrsquo00][BDHSrsquo09]

                                                                                                Sym Indefinite [BBDDDPSTYrsquo13] [BBDDDPSTYrsquo13]

                                                                                                LU[Grsquo97][Trsquo97]

                                                                                                [GDXrsquo11][BDLSTrsquo13]

                                                                                                [GDXrsquo11][BDLSTrsquo13]

                                                                                                [Grsquo97][Trsquo97] [BDLSTrsquo13] [BDLSTrsquo13]

                                                                                                QR[EGrsquo98][FWrsquo03]

                                                                                                [DGHLrsquo12][BDLSTrsquo13]

                                                                                                [FWrsquo03][DGHLrsquo12][BDLSTrsquo13]

                                                                                                [EGrsquo98][FWrsquo03][BDLSTrsquo13]

                                                                                                [FWrsquo03][BDLSTrsquo13]

                                                                                                Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

                                                                                                Sym Eig amp SVD [BDDrsquo11][BDKrsquo13] [BDDrsquo11]

                                                                                                Non Sym Eig [BDDrsquo11] [BDDrsquo11]

                                                                                                Legend[Existing][Ours][Math-Lib][Random]

                                                                                                Words (BW) Messages (L) Saving factor

                                                                                                BLAS-3 [AGZrsquo94][MTrsquo99][ScaLAPACK][Crsquo69][vGWrsquo97][SDrsquo11] L nP12

                                                                                                Cholesky [ScaLAPACK][Trsquo99][SDrsquo11] L nP12

                                                                                                Sym Indefinite [BBDDDPSTYrsquo13][ScaLAPACK] [BBDDDPSTYrsquo13] L nP12

                                                                                                LU [ScaLAPACK][GDXrsquo11][Trsquo99][SDrsquo11] [GDXrsquo11][Trsquo99][SDrsquo11] L nP12

                                                                                                QR [ScaLAPACK][DGHLrsquo12] [Trsquo99] [DGHLrsquo12][Trsquo99] L nP12

                                                                                                Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

                                                                                                Sym Eig amp SVD [BDDrsquo11][BDKrsquo13][ScaLAPACK] [BDDrsquo11][BDKrsquo13] L nP12

                                                                                                Non-Sym Eig [BDDrsquo11] [BDDrsquo11] BW P12 L n

                                                                                                Attaining with extra memory 25D M=(cn2P)

                                                                                                Attaining the Lower bounds Parallel 2DM=(n2P)(Ignoring poly-log(P) factors words = ( n2 P12) messages = (P12)

                                                                                                Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                Avoiding Communication in Iterative Linear Algebra

                                                                                                bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

                                                                                                bull Conjugate Gradients (CG) GMRES Lanczos Arnoldi hellip bull Goal minimize communication

                                                                                                ndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

                                                                                                bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

                                                                                                ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

                                                                                                bull Lots of speed up possible (modeled and measured)ndash Price some redundant computationndash Challenges Poor partitioning Preconditioning Num Stability

                                                                                                75

                                                                                                Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                Example The Difficulty of Tuning SpMV

                                                                                                bull n = 21200bull nnz = 15 M

                                                                                                bull Source NASA structural analysis problem (raefsky)

                                                                                                77

                                                                                                Example The Difficulty of Tuning

                                                                                                bull n = 21200bull nnz = 15 M

                                                                                                bull Source NASA structural analysis problem (raefsky)

                                                                                                bull 8x8 dense substructure exploit this to limit mem_refs

                                                                                                78

                                                                                                Speedups on Itanium 2 The Need for Search

                                                                                                Reference

                                                                                                Best 4x2

                                                                                                Mflops

                                                                                                Mflops

                                                                                                79

                                                                                                Register Profile Itanium 2

                                                                                                190 Mflops

                                                                                                1190 Mflops

                                                                                                80

                                                                                                Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16

                                                                                                Itanium 2 - 33Itanium 1 - 8

                                                                                                252 Mflops

                                                                                                122 Mflops

                                                                                                820 Mflops

                                                                                                459 Mflops

                                                                                                247 Mflops

                                                                                                107 Mflops

                                                                                                12 Gflops

                                                                                                190 Mflops

                                                                                                Another example of tuning challenges for SpMV

                                                                                                bull Ex11 matrix (fluid flow)

                                                                                                bull More complicated non-zero structure in general

                                                                                                bull N = 16614bull NNZ = 11M

                                                                                                82

                                                                                                Zoom in to top corner

                                                                                                bull More complicated non-zero structure in general

                                                                                                bull N = 16614bull NNZ = 11M

                                                                                                83

                                                                                                3x3 blocks look natural buthellip

                                                                                                bull Example 3x3 blockingndash Logical grid of 3x3 cells

                                                                                                bull But would lead to lots of ldquofill-inrdquo

                                                                                                84

                                                                                                Extra Work Can Improve Efficiency

                                                                                                bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

                                                                                                bull On Pentium III 15x speedupndash Actual mflop rate 152 = 225 higher

                                                                                                85

                                                                                                Source Accelerator Cavity Design Problem (Ko via Husbands)

                                                                                                86

                                                                                                100x100 Submatrix Along Diagonal

                                                                                                Summer School Lecture 787

                                                                                                Post-RCM Reordering

                                                                                                88

                                                                                                Effect of Combined RCM+TSP Reordering

                                                                                                Before Green + RedAfter Green + Blue

                                                                                                Summer School Lecture 789

                                                                                                2x speedups on Pentium 4 Power 4 hellip

                                                                                                Summary of Other Performance Optimizations

                                                                                                bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

                                                                                                bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

                                                                                                bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

                                                                                                90

                                                                                                Optimized Sparse Kernel Interface - OSKI

                                                                                                bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

                                                                                                bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

                                                                                                bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

                                                                                                software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

                                                                                                91

                                                                                                Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                93

                                                                                                Example Classical Conjugate Gradient (CG)

                                                                                                SpMVs and dot products require communication in

                                                                                                each iteration

                                                                                                via CA Matrix Powers Kernel

                                                                                                Global reduction to compute G

                                                                                                94

                                                                                                Example CA-Conjugate Gradient

                                                                                                Local computations within inner loop require

                                                                                                no communication

                                                                                                Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                96

                                                                                                Slower convergence due

                                                                                                to roundoff

                                                                                                Loss of accuracy due to roundoff

                                                                                                At s = 16 monomial basis is rank deficient Method breaks down

                                                                                                Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

                                                                                                CA-CG (monomial)CG

                                                                                                machine precision

                                                                                                97

                                                                                                Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

                                                                                                bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

                                                                                                bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

                                                                                                matrices

                                                                                                Explicit (O(nnz)) Implicit (o(nnz))

                                                                                                Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

                                                                                                Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

                                                                                                Indices

                                                                                                Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                101

                                                                                                bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

                                                                                                ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

                                                                                                demanded by customers (construction engineers) otherwise they donrsquot believe results

                                                                                                ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

                                                                                                ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

                                                                                                Reproducible Floating Point Computation

                                                                                                Absolute Error for Random Vectors

                                                                                                Same magnitude opposite signs

                                                                                                Intel MKL non-reproducibility

                                                                                                Relative Error for Orthogonal vectors

                                                                                                Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

                                                                                                Sign notreproducible

                                                                                                103

                                                                                                bull Consider summation or dot productbull Goals

                                                                                                1 Same answer independent of layout processors order of summands

                                                                                                2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

                                                                                                bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

                                                                                                GoalsApproaches for Reproducibility

                                                                                                104

                                                                                                Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

                                                                                                Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

                                                                                                Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

                                                                                                bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

                                                                                                Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

                                                                                                bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

                                                                                                Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

                                                                                                Instruments NEC Nokia NVIDIA Samsung Oracle

                                                                                                bull bebopcsberkeleyedu

                                                                                                Summary

                                                                                                Donrsquot Communichellip

                                                                                                106

                                                                                                Time to redesign all linear algebra n-body hellip algorithms and software

                                                                                                (and compilers)

                                                                                                • Implementing Communication-Avoiding Algorithms
                                                                                                • Why avoid communication
                                                                                                • Goals
                                                                                                • Outline
                                                                                                • Outline (2)
                                                                                                • Lower bound for all ldquon3-likerdquo linear algebra
                                                                                                • Lower bound for all ldquon3-likerdquo linear algebra (2)
                                                                                                • Lower bound for all ldquon3-likerdquo linear algebra (3)
                                                                                                • Limits to parallel scaling (12)
                                                                                                • Limits to parallel scaling (22)
                                                                                                • Can we attain these lower bounds
                                                                                                • Outline (3)
                                                                                                • 25D Matrix Multiplication
                                                                                                • 25D Matrix Multiplication (2)
                                                                                                • 25D Matmul on BGP 16K nodes 64K cores (2)
                                                                                                • Perfect Strong Scaling ndash in Time and Energy (12)
                                                                                                • Perfect Strong Scaling ndash in Time and Energy (22)
                                                                                                • Handling Heterogeneity
                                                                                                • Application to Tensor Contractions
                                                                                                • C(ijk) = Σm A(ijm)B(mk)
                                                                                                • Application to Tensor Contractions (2)
                                                                                                • Communication Lower Bounds for Strassen-like matmul algorithms
                                                                                                • vs
                                                                                                • Slide 26
                                                                                                • Strassen-like beyond matmul
                                                                                                • Cache and Network Oblivious Algorithms
                                                                                                • CARMA Performance Distributed Memory
                                                                                                • CARMA Performance Distributed Memory (2)
                                                                                                • CARMA Performance Shared Memory
                                                                                                • CARMA Performance Shared Memory (2)
                                                                                                • Why is CARMA Faster in Shared Memory
                                                                                                • Outline (4)
                                                                                                • One-sided Factorizations (LU QR) so far
                                                                                                • TSQR An Architecture-Dependent Algorithm
                                                                                                • Back to LU Using similar idea for TSLU as TSQR Use reduction
                                                                                                • Minimizing Communication in TSLU
                                                                                                • Making TSLU Numerically Stable
                                                                                                • Stability of LU using TSLU CALU
                                                                                                • Why is stability of TSLU just a ldquoThmrdquo
                                                                                                • Fixing TSLU
                                                                                                • 2D CALU with Tournament Pivoting
                                                                                                • 25D CALU with Tournament Pivoting (c=4 copies)
                                                                                                • Exascale Machine Parameters Source DOE Exascale Workshop
                                                                                                • Exascale predicted speedups for Gaussian Elimination 2D CA
                                                                                                • 25D vs 2D LU With and Without Pivoting
                                                                                                • Other CA algorithms for Ax=b least squares(13)
                                                                                                • Other CA algorithms for Ax=b least squares (23)
                                                                                                • Other CA algorithms for Ax=b least squares (33)
                                                                                                • Outline (5)
                                                                                                • What about sparse matrices (13)
                                                                                                • Performance of 25D APSP using Kleene
                                                                                                • What about sparse matrices (23)
                                                                                                • What about sparse matrices (33)
                                                                                                • Outline (6)
                                                                                                • Symmetric Eigenproblem and SVD
                                                                                                • Slide 58
                                                                                                • Slide 59
                                                                                                • Slide 60
                                                                                                • Slide 61
                                                                                                • Slide 62
                                                                                                • Slide 63
                                                                                                • Slide 64
                                                                                                • Slide 65
                                                                                                • Slide 66
                                                                                                • Slide 67
                                                                                                • Slide 68
                                                                                                • Conventional vs CA - SBR
                                                                                                • Speedups of Sym Band Reduction vs DSBTRD
                                                                                                • Nonsymmetric Eigenproblem
                                                                                                • Attaining the Lower bounds Sequential
                                                                                                • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
                                                                                                • Outline (7)
                                                                                                • Avoiding Communication in Iterative Linear Algebra
                                                                                                • Outline (8)
                                                                                                • Example The Difficulty of Tuning SpMV
                                                                                                • Example The Difficulty of Tuning
                                                                                                • Speedups on Itanium 2 The Need for Search
                                                                                                • Register Profile Itanium 2
                                                                                                • Register Profiles IBM and Intel IA-64
                                                                                                • Another example of tuning challenges for SpMV
                                                                                                • Zoom in to top corner
                                                                                                • 3x3 blocks look natural buthellip
                                                                                                • Extra Work Can Improve Efficiency
                                                                                                • Slide 86
                                                                                                • Slide 87
                                                                                                • Slide 88
                                                                                                • Slide 89
                                                                                                • Summary of Other Performance Optimizations
                                                                                                • Optimized Sparse Kernel Interface - OSKI
                                                                                                • Outline (9)
                                                                                                • Example Classical Conjugate Gradient (CG)
                                                                                                • Example CA-Conjugate Gradient
                                                                                                • Outline (10)
                                                                                                • Slide 96
                                                                                                • Slide 97
                                                                                                • Outline (11)
                                                                                                • What is a ldquosparse matrixrdquo
                                                                                                • Outline (12)
                                                                                                • Reproducible Floating Point Computation
                                                                                                • Intel MKL non-reproducibility
                                                                                                • GoalsApproaches for Reproducibility
                                                                                                • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
                                                                                                • Collaborators and Supporters
                                                                                                • Summary

                                                                                                  Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                  ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                  ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                  bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                  bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                  What about sparse matrices (13)

                                                                                                  bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

                                                                                                  bull But canrsquot reorder outer loop for 25D need another ideabull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = AB

                                                                                                  ndash Dependencies ok 25D works just different semiringbull Kleenersquos Algorithm

                                                                                                  52

                                                                                                  for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

                                                                                                  D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

                                                                                                  Performance of 25D APSP using Kleene

                                                                                                  53

                                                                                                  Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

                                                                                                  62xspeedup

                                                                                                  2x speedup

                                                                                                  What about sparse matrices (23)

                                                                                                  bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of G(A)

                                                                                                  have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

                                                                                                  bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering for

                                                                                                  2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

                                                                                                  bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

                                                                                                  multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

                                                                                                  separators)

                                                                                                  54

                                                                                                  What about sparse matrices (33)

                                                                                                  bull If matrix stays very sparse lower bound unattainable new one bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

                                                                                                  data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

                                                                                                  bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

                                                                                                  Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

                                                                                                  bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

                                                                                                  along dimensions most likely to minimize cost55

                                                                                                  Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                  ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                  ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                  bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                  bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                  Symmetric Eigenproblem and SVD

                                                                                                  bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd

                                                                                                  bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12

                                                                                                  ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea

                                                                                                  b+1

                                                                                                  b+1

                                                                                                  Successive Band Reduction (BischofLangSun)

                                                                                                  1

                                                                                                  b+1

                                                                                                  b+1

                                                                                                  d+1

                                                                                                  c

                                                                                                  Successive Band Reduction (BischofLangSun)

                                                                                                  b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                                  1Q1

                                                                                                  b+1

                                                                                                  b+1

                                                                                                  d+1

                                                                                                  c

                                                                                                  b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                                  Successive Band Reduction (BischofLangSun)

                                                                                                  12

                                                                                                  Q1

                                                                                                  b+1

                                                                                                  b+1

                                                                                                  d+1

                                                                                                  d+c

                                                                                                  d+c

                                                                                                  c

                                                                                                  b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                                  Successive Band Reduction (BischofLangSun)

                                                                                                  1

                                                                                                  12

                                                                                                  Q1

                                                                                                  Q1T

                                                                                                  b+1

                                                                                                  b+1

                                                                                                  d+1

                                                                                                  d+1

                                                                                                  cd+c

                                                                                                  d+c

                                                                                                  c

                                                                                                  b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                                  Successive Band Reduction (BischofLangSun)

                                                                                                  1

                                                                                                  1

                                                                                                  2

                                                                                                  2Q1

                                                                                                  Q1T

                                                                                                  b+1

                                                                                                  b+1

                                                                                                  d+1

                                                                                                  d+1

                                                                                                  cd+c

                                                                                                  d+c

                                                                                                  d+c

                                                                                                  d+c

                                                                                                  c

                                                                                                  b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                                  Successive Band Reduction (BischofLangSun)

                                                                                                  1

                                                                                                  1

                                                                                                  2

                                                                                                  2

                                                                                                  3

                                                                                                  3

                                                                                                  Q1

                                                                                                  Q1T

                                                                                                  Q2

                                                                                                  Q2T

                                                                                                  b+1

                                                                                                  b+1

                                                                                                  d+1

                                                                                                  d+1

                                                                                                  d+c

                                                                                                  d+c

                                                                                                  d+c

                                                                                                  d+c

                                                                                                  c

                                                                                                  c

                                                                                                  b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                                  Successive Band Reduction (BischofLangSun)

                                                                                                  1

                                                                                                  1

                                                                                                  2

                                                                                                  2

                                                                                                  3

                                                                                                  3

                                                                                                  4

                                                                                                  4

                                                                                                  Q1

                                                                                                  Q1T

                                                                                                  Q2

                                                                                                  Q2T

                                                                                                  Q3

                                                                                                  Q3T

                                                                                                  b+1

                                                                                                  b+1

                                                                                                  d+1

                                                                                                  d+1

                                                                                                  d+c

                                                                                                  d+c

                                                                                                  d+c

                                                                                                  d+c

                                                                                                  c

                                                                                                  c

                                                                                                  b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                                  Successive Band Reduction (BischofLangSun)

                                                                                                  1

                                                                                                  1

                                                                                                  2

                                                                                                  2

                                                                                                  3

                                                                                                  3

                                                                                                  4

                                                                                                  4

                                                                                                  5

                                                                                                  5

                                                                                                  Q1

                                                                                                  Q1T

                                                                                                  Q2

                                                                                                  Q2T

                                                                                                  Q3

                                                                                                  Q3T

                                                                                                  Q4

                                                                                                  Q4T

                                                                                                  b+1

                                                                                                  b+1

                                                                                                  d+1

                                                                                                  d+1

                                                                                                  c

                                                                                                  c

                                                                                                  d+c

                                                                                                  d+c

                                                                                                  d+c

                                                                                                  d+c

                                                                                                  b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                                  Successive Band Reduction (BischofLangSun)

                                                                                                  1

                                                                                                  1

                                                                                                  2

                                                                                                  2

                                                                                                  3

                                                                                                  3

                                                                                                  4

                                                                                                  4

                                                                                                  5

                                                                                                  5

                                                                                                  Q5T

                                                                                                  Q1

                                                                                                  Q1T

                                                                                                  Q2

                                                                                                  Q2T

                                                                                                  Q3

                                                                                                  Q3T

                                                                                                  Q5

                                                                                                  Q4

                                                                                                  Q4T

                                                                                                  b+1

                                                                                                  b+1

                                                                                                  d+1

                                                                                                  d+1

                                                                                                  c

                                                                                                  c

                                                                                                  d+c

                                                                                                  d+c

                                                                                                  d+c

                                                                                                  d+c

                                                                                                  b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                                  Successive Band Reduction (BischofLangSun)

                                                                                                  1

                                                                                                  1

                                                                                                  2

                                                                                                  2

                                                                                                  3

                                                                                                  3

                                                                                                  4

                                                                                                  4

                                                                                                  5

                                                                                                  5

                                                                                                  6

                                                                                                  6

                                                                                                  Q5T

                                                                                                  Q1

                                                                                                  Q1T

                                                                                                  Q2

                                                                                                  Q2T

                                                                                                  Q3

                                                                                                  Q3T

                                                                                                  Q5

                                                                                                  Q4

                                                                                                  Q4T

                                                                                                  b+1

                                                                                                  b+1

                                                                                                  d+1

                                                                                                  d+1

                                                                                                  c

                                                                                                  c

                                                                                                  d+c

                                                                                                  d+c

                                                                                                  d+c

                                                                                                  d+c

                                                                                                  b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                                  Successive Band Reduction (BischofLangSun)

                                                                                                  Conventional vs CA - SBR

                                                                                                  Conventional Communication-Avoiding

                                                                                                  Touch all data 4 times Touch all data once

                                                                                                  >
                                                                                                  >

                                                                                                  Speedups of Sym Band Reductionvs DSBTRD

                                                                                                  bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

                                                                                                  bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

                                                                                                  bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

                                                                                                  bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

                                                                                                  bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

                                                                                                  Nonsymmetric Eigenproblem

                                                                                                  bull No apparent way to modify standard algorithmbull Instead Spectral Divide-and-Conquer

                                                                                                  ndash Find orthogonal matrix Q whose leading columns span an invariant subspace of A

                                                                                                  ndash QTAQ will be block upper triangularndash Apply recursively to A11 A22

                                                                                                  ndash Depends on randomization1 Randomized Rank Revealing QR decomposition2 Randomized location to try splitting spectrum

                                                                                                  A11 A12

                                                                                                  ε A22

                                                                                                  Attaining the Lower bounds SequentialLegend[Existing][Ours][Math-Lib][Random]

                                                                                                  Two Levels Memory Hierarchy

                                                                                                  Words Messages Words Messages

                                                                                                  BLAS-3 [FLPRrsquo99][BDLSTrsquo13][MKL etc] [FLPRrsquo99][BDLSTrsquo13][MKL etc]

                                                                                                  Cholesky[Grsquo97][APrsquo00]

                                                                                                  [LAPACK][BDHSrsquo09]

                                                                                                  [Grsquo97][APrsquo00][BDHSrsquo09] [Grsquo97][APrsquo00][BDHSrsquo09]

                                                                                                  Sym Indefinite [BBDDDPSTYrsquo13] [BBDDDPSTYrsquo13]

                                                                                                  LU[Grsquo97][Trsquo97]

                                                                                                  [GDXrsquo11][BDLSTrsquo13]

                                                                                                  [GDXrsquo11][BDLSTrsquo13]

                                                                                                  [Grsquo97][Trsquo97] [BDLSTrsquo13] [BDLSTrsquo13]

                                                                                                  QR[EGrsquo98][FWrsquo03]

                                                                                                  [DGHLrsquo12][BDLSTrsquo13]

                                                                                                  [FWrsquo03][DGHLrsquo12][BDLSTrsquo13]

                                                                                                  [EGrsquo98][FWrsquo03][BDLSTrsquo13]

                                                                                                  [FWrsquo03][BDLSTrsquo13]

                                                                                                  Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

                                                                                                  Sym Eig amp SVD [BDDrsquo11][BDKrsquo13] [BDDrsquo11]

                                                                                                  Non Sym Eig [BDDrsquo11] [BDDrsquo11]

                                                                                                  Legend[Existing][Ours][Math-Lib][Random]

                                                                                                  Words (BW) Messages (L) Saving factor

                                                                                                  BLAS-3 [AGZrsquo94][MTrsquo99][ScaLAPACK][Crsquo69][vGWrsquo97][SDrsquo11] L nP12

                                                                                                  Cholesky [ScaLAPACK][Trsquo99][SDrsquo11] L nP12

                                                                                                  Sym Indefinite [BBDDDPSTYrsquo13][ScaLAPACK] [BBDDDPSTYrsquo13] L nP12

                                                                                                  LU [ScaLAPACK][GDXrsquo11][Trsquo99][SDrsquo11] [GDXrsquo11][Trsquo99][SDrsquo11] L nP12

                                                                                                  QR [ScaLAPACK][DGHLrsquo12] [Trsquo99] [DGHLrsquo12][Trsquo99] L nP12

                                                                                                  Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

                                                                                                  Sym Eig amp SVD [BDDrsquo11][BDKrsquo13][ScaLAPACK] [BDDrsquo11][BDKrsquo13] L nP12

                                                                                                  Non-Sym Eig [BDDrsquo11] [BDDrsquo11] BW P12 L n

                                                                                                  Attaining with extra memory 25D M=(cn2P)

                                                                                                  Attaining the Lower bounds Parallel 2DM=(n2P)(Ignoring poly-log(P) factors words = ( n2 P12) messages = (P12)

                                                                                                  Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                  ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                  ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                  bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                  bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                  Avoiding Communication in Iterative Linear Algebra

                                                                                                  bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

                                                                                                  bull Conjugate Gradients (CG) GMRES Lanczos Arnoldi hellip bull Goal minimize communication

                                                                                                  ndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

                                                                                                  bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

                                                                                                  ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

                                                                                                  bull Lots of speed up possible (modeled and measured)ndash Price some redundant computationndash Challenges Poor partitioning Preconditioning Num Stability

                                                                                                  75

                                                                                                  Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                  ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                  ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                  bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                  bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                  Example The Difficulty of Tuning SpMV

                                                                                                  bull n = 21200bull nnz = 15 M

                                                                                                  bull Source NASA structural analysis problem (raefsky)

                                                                                                  77

                                                                                                  Example The Difficulty of Tuning

                                                                                                  bull n = 21200bull nnz = 15 M

                                                                                                  bull Source NASA structural analysis problem (raefsky)

                                                                                                  bull 8x8 dense substructure exploit this to limit mem_refs

                                                                                                  78

                                                                                                  Speedups on Itanium 2 The Need for Search

                                                                                                  Reference

                                                                                                  Best 4x2

                                                                                                  Mflops

                                                                                                  Mflops

                                                                                                  79

                                                                                                  Register Profile Itanium 2

                                                                                                  190 Mflops

                                                                                                  1190 Mflops

                                                                                                  80

                                                                                                  Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16

                                                                                                  Itanium 2 - 33Itanium 1 - 8

                                                                                                  252 Mflops

                                                                                                  122 Mflops

                                                                                                  820 Mflops

                                                                                                  459 Mflops

                                                                                                  247 Mflops

                                                                                                  107 Mflops

                                                                                                  12 Gflops

                                                                                                  190 Mflops

                                                                                                  Another example of tuning challenges for SpMV

                                                                                                  bull Ex11 matrix (fluid flow)

                                                                                                  bull More complicated non-zero structure in general

                                                                                                  bull N = 16614bull NNZ = 11M

                                                                                                  82

                                                                                                  Zoom in to top corner

                                                                                                  bull More complicated non-zero structure in general

                                                                                                  bull N = 16614bull NNZ = 11M

                                                                                                  83

                                                                                                  3x3 blocks look natural buthellip

                                                                                                  bull Example 3x3 blockingndash Logical grid of 3x3 cells

                                                                                                  bull But would lead to lots of ldquofill-inrdquo

                                                                                                  84

                                                                                                  Extra Work Can Improve Efficiency

                                                                                                  bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

                                                                                                  bull On Pentium III 15x speedupndash Actual mflop rate 152 = 225 higher

                                                                                                  85

                                                                                                  Source Accelerator Cavity Design Problem (Ko via Husbands)

                                                                                                  86

                                                                                                  100x100 Submatrix Along Diagonal

                                                                                                  Summer School Lecture 787

                                                                                                  Post-RCM Reordering

                                                                                                  88

                                                                                                  Effect of Combined RCM+TSP Reordering

                                                                                                  Before Green + RedAfter Green + Blue

                                                                                                  Summer School Lecture 789

                                                                                                  2x speedups on Pentium 4 Power 4 hellip

                                                                                                  Summary of Other Performance Optimizations

                                                                                                  bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

                                                                                                  bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

                                                                                                  bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

                                                                                                  90

                                                                                                  Optimized Sparse Kernel Interface - OSKI

                                                                                                  bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

                                                                                                  bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

                                                                                                  bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

                                                                                                  software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

                                                                                                  91

                                                                                                  Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                  ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                  ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                  bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                  bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                  93

                                                                                                  Example Classical Conjugate Gradient (CG)

                                                                                                  SpMVs and dot products require communication in

                                                                                                  each iteration

                                                                                                  via CA Matrix Powers Kernel

                                                                                                  Global reduction to compute G

                                                                                                  94

                                                                                                  Example CA-Conjugate Gradient

                                                                                                  Local computations within inner loop require

                                                                                                  no communication

                                                                                                  Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                  ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                  ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                  bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                  bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                  96

                                                                                                  Slower convergence due

                                                                                                  to roundoff

                                                                                                  Loss of accuracy due to roundoff

                                                                                                  At s = 16 monomial basis is rank deficient Method breaks down

                                                                                                  Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

                                                                                                  CA-CG (monomial)CG

                                                                                                  machine precision

                                                                                                  97

                                                                                                  Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                  ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                  ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                  bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                  bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                  What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

                                                                                                  bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

                                                                                                  bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

                                                                                                  matrices

                                                                                                  Explicit (O(nnz)) Implicit (o(nnz))

                                                                                                  Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

                                                                                                  Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

                                                                                                  Indices

                                                                                                  Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                  ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                  ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                  bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                  bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                  101

                                                                                                  bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

                                                                                                  ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

                                                                                                  demanded by customers (construction engineers) otherwise they donrsquot believe results

                                                                                                  ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

                                                                                                  ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

                                                                                                  Reproducible Floating Point Computation

                                                                                                  Absolute Error for Random Vectors

                                                                                                  Same magnitude opposite signs

                                                                                                  Intel MKL non-reproducibility

                                                                                                  Relative Error for Orthogonal vectors

                                                                                                  Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

                                                                                                  Sign notreproducible

                                                                                                  103

                                                                                                  bull Consider summation or dot productbull Goals

                                                                                                  1 Same answer independent of layout processors order of summands

                                                                                                  2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

                                                                                                  bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

                                                                                                  GoalsApproaches for Reproducibility

                                                                                                  104

                                                                                                  Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

                                                                                                  Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

                                                                                                  Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

                                                                                                  bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

                                                                                                  Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

                                                                                                  bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

                                                                                                  Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

                                                                                                  Instruments NEC Nokia NVIDIA Samsung Oracle

                                                                                                  bull bebopcsberkeleyedu

                                                                                                  Summary

                                                                                                  Donrsquot Communichellip

                                                                                                  106

                                                                                                  Time to redesign all linear algebra n-body hellip algorithms and software

                                                                                                  (and compilers)

                                                                                                  • Implementing Communication-Avoiding Algorithms
                                                                                                  • Why avoid communication
                                                                                                  • Goals
                                                                                                  • Outline
                                                                                                  • Outline (2)
                                                                                                  • Lower bound for all ldquon3-likerdquo linear algebra
                                                                                                  • Lower bound for all ldquon3-likerdquo linear algebra (2)
                                                                                                  • Lower bound for all ldquon3-likerdquo linear algebra (3)
                                                                                                  • Limits to parallel scaling (12)
                                                                                                  • Limits to parallel scaling (22)
                                                                                                  • Can we attain these lower bounds
                                                                                                  • Outline (3)
                                                                                                  • 25D Matrix Multiplication
                                                                                                  • 25D Matrix Multiplication (2)
                                                                                                  • 25D Matmul on BGP 16K nodes 64K cores (2)
                                                                                                  • Perfect Strong Scaling ndash in Time and Energy (12)
                                                                                                  • Perfect Strong Scaling ndash in Time and Energy (22)
                                                                                                  • Handling Heterogeneity
                                                                                                  • Application to Tensor Contractions
                                                                                                  • C(ijk) = Σm A(ijm)B(mk)
                                                                                                  • Application to Tensor Contractions (2)
                                                                                                  • Communication Lower Bounds for Strassen-like matmul algorithms
                                                                                                  • vs
                                                                                                  • Slide 26
                                                                                                  • Strassen-like beyond matmul
                                                                                                  • Cache and Network Oblivious Algorithms
                                                                                                  • CARMA Performance Distributed Memory
                                                                                                  • CARMA Performance Distributed Memory (2)
                                                                                                  • CARMA Performance Shared Memory
                                                                                                  • CARMA Performance Shared Memory (2)
                                                                                                  • Why is CARMA Faster in Shared Memory
                                                                                                  • Outline (4)
                                                                                                  • One-sided Factorizations (LU QR) so far
                                                                                                  • TSQR An Architecture-Dependent Algorithm
                                                                                                  • Back to LU Using similar idea for TSLU as TSQR Use reduction
                                                                                                  • Minimizing Communication in TSLU
                                                                                                  • Making TSLU Numerically Stable
                                                                                                  • Stability of LU using TSLU CALU
                                                                                                  • Why is stability of TSLU just a ldquoThmrdquo
                                                                                                  • Fixing TSLU
                                                                                                  • 2D CALU with Tournament Pivoting
                                                                                                  • 25D CALU with Tournament Pivoting (c=4 copies)
                                                                                                  • Exascale Machine Parameters Source DOE Exascale Workshop
                                                                                                  • Exascale predicted speedups for Gaussian Elimination 2D CA
                                                                                                  • 25D vs 2D LU With and Without Pivoting
                                                                                                  • Other CA algorithms for Ax=b least squares(13)
                                                                                                  • Other CA algorithms for Ax=b least squares (23)
                                                                                                  • Other CA algorithms for Ax=b least squares (33)
                                                                                                  • Outline (5)
                                                                                                  • What about sparse matrices (13)
                                                                                                  • Performance of 25D APSP using Kleene
                                                                                                  • What about sparse matrices (23)
                                                                                                  • What about sparse matrices (33)
                                                                                                  • Outline (6)
                                                                                                  • Symmetric Eigenproblem and SVD
                                                                                                  • Slide 58
                                                                                                  • Slide 59
                                                                                                  • Slide 60
                                                                                                  • Slide 61
                                                                                                  • Slide 62
                                                                                                  • Slide 63
                                                                                                  • Slide 64
                                                                                                  • Slide 65
                                                                                                  • Slide 66
                                                                                                  • Slide 67
                                                                                                  • Slide 68
                                                                                                  • Conventional vs CA - SBR
                                                                                                  • Speedups of Sym Band Reduction vs DSBTRD
                                                                                                  • Nonsymmetric Eigenproblem
                                                                                                  • Attaining the Lower bounds Sequential
                                                                                                  • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
                                                                                                  • Outline (7)
                                                                                                  • Avoiding Communication in Iterative Linear Algebra
                                                                                                  • Outline (8)
                                                                                                  • Example The Difficulty of Tuning SpMV
                                                                                                  • Example The Difficulty of Tuning
                                                                                                  • Speedups on Itanium 2 The Need for Search
                                                                                                  • Register Profile Itanium 2
                                                                                                  • Register Profiles IBM and Intel IA-64
                                                                                                  • Another example of tuning challenges for SpMV
                                                                                                  • Zoom in to top corner
                                                                                                  • 3x3 blocks look natural buthellip
                                                                                                  • Extra Work Can Improve Efficiency
                                                                                                  • Slide 86
                                                                                                  • Slide 87
                                                                                                  • Slide 88
                                                                                                  • Slide 89
                                                                                                  • Summary of Other Performance Optimizations
                                                                                                  • Optimized Sparse Kernel Interface - OSKI
                                                                                                  • Outline (9)
                                                                                                  • Example Classical Conjugate Gradient (CG)
                                                                                                  • Example CA-Conjugate Gradient
                                                                                                  • Outline (10)
                                                                                                  • Slide 96
                                                                                                  • Slide 97
                                                                                                  • Outline (11)
                                                                                                  • What is a ldquosparse matrixrdquo
                                                                                                  • Outline (12)
                                                                                                  • Reproducible Floating Point Computation
                                                                                                  • Intel MKL non-reproducibility
                                                                                                  • GoalsApproaches for Reproducibility
                                                                                                  • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
                                                                                                  • Collaborators and Supporters
                                                                                                  • Summary

                                                                                                    What about sparse matrices (13)

                                                                                                    bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

                                                                                                    bull But canrsquot reorder outer loop for 25D need another ideabull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = AB

                                                                                                    ndash Dependencies ok 25D works just different semiringbull Kleenersquos Algorithm

                                                                                                    52

                                                                                                    for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

                                                                                                    D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

                                                                                                    Performance of 25D APSP using Kleene

                                                                                                    53

                                                                                                    Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

                                                                                                    62xspeedup

                                                                                                    2x speedup

                                                                                                    What about sparse matrices (23)

                                                                                                    bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of G(A)

                                                                                                    have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

                                                                                                    bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering for

                                                                                                    2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

                                                                                                    bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

                                                                                                    multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

                                                                                                    separators)

                                                                                                    54

                                                                                                    What about sparse matrices (33)

                                                                                                    bull If matrix stays very sparse lower bound unattainable new one bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

                                                                                                    data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

                                                                                                    bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

                                                                                                    Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

                                                                                                    bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

                                                                                                    along dimensions most likely to minimize cost55

                                                                                                    Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                    ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                    ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                    bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                    bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                    Symmetric Eigenproblem and SVD

                                                                                                    bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd

                                                                                                    bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12

                                                                                                    ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea

                                                                                                    b+1

                                                                                                    b+1

                                                                                                    Successive Band Reduction (BischofLangSun)

                                                                                                    1

                                                                                                    b+1

                                                                                                    b+1

                                                                                                    d+1

                                                                                                    c

                                                                                                    Successive Band Reduction (BischofLangSun)

                                                                                                    b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                                    1Q1

                                                                                                    b+1

                                                                                                    b+1

                                                                                                    d+1

                                                                                                    c

                                                                                                    b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                                    Successive Band Reduction (BischofLangSun)

                                                                                                    12

                                                                                                    Q1

                                                                                                    b+1

                                                                                                    b+1

                                                                                                    d+1

                                                                                                    d+c

                                                                                                    d+c

                                                                                                    c

                                                                                                    b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                                    Successive Band Reduction (BischofLangSun)

                                                                                                    1

                                                                                                    12

                                                                                                    Q1

                                                                                                    Q1T

                                                                                                    b+1

                                                                                                    b+1

                                                                                                    d+1

                                                                                                    d+1

                                                                                                    cd+c

                                                                                                    d+c

                                                                                                    c

                                                                                                    b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                                    Successive Band Reduction (BischofLangSun)

                                                                                                    1

                                                                                                    1

                                                                                                    2

                                                                                                    2Q1

                                                                                                    Q1T

                                                                                                    b+1

                                                                                                    b+1

                                                                                                    d+1

                                                                                                    d+1

                                                                                                    cd+c

                                                                                                    d+c

                                                                                                    d+c

                                                                                                    d+c

                                                                                                    c

                                                                                                    b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                                    Successive Band Reduction (BischofLangSun)

                                                                                                    1

                                                                                                    1

                                                                                                    2

                                                                                                    2

                                                                                                    3

                                                                                                    3

                                                                                                    Q1

                                                                                                    Q1T

                                                                                                    Q2

                                                                                                    Q2T

                                                                                                    b+1

                                                                                                    b+1

                                                                                                    d+1

                                                                                                    d+1

                                                                                                    d+c

                                                                                                    d+c

                                                                                                    d+c

                                                                                                    d+c

                                                                                                    c

                                                                                                    c

                                                                                                    b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                                    Successive Band Reduction (BischofLangSun)

                                                                                                    1

                                                                                                    1

                                                                                                    2

                                                                                                    2

                                                                                                    3

                                                                                                    3

                                                                                                    4

                                                                                                    4

                                                                                                    Q1

                                                                                                    Q1T

                                                                                                    Q2

                                                                                                    Q2T

                                                                                                    Q3

                                                                                                    Q3T

                                                                                                    b+1

                                                                                                    b+1

                                                                                                    d+1

                                                                                                    d+1

                                                                                                    d+c

                                                                                                    d+c

                                                                                                    d+c

                                                                                                    d+c

                                                                                                    c

                                                                                                    c

                                                                                                    b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                                    Successive Band Reduction (BischofLangSun)

                                                                                                    1

                                                                                                    1

                                                                                                    2

                                                                                                    2

                                                                                                    3

                                                                                                    3

                                                                                                    4

                                                                                                    4

                                                                                                    5

                                                                                                    5

                                                                                                    Q1

                                                                                                    Q1T

                                                                                                    Q2

                                                                                                    Q2T

                                                                                                    Q3

                                                                                                    Q3T

                                                                                                    Q4

                                                                                                    Q4T

                                                                                                    b+1

                                                                                                    b+1

                                                                                                    d+1

                                                                                                    d+1

                                                                                                    c

                                                                                                    c

                                                                                                    d+c

                                                                                                    d+c

                                                                                                    d+c

                                                                                                    d+c

                                                                                                    b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                                    Successive Band Reduction (BischofLangSun)

                                                                                                    1

                                                                                                    1

                                                                                                    2

                                                                                                    2

                                                                                                    3

                                                                                                    3

                                                                                                    4

                                                                                                    4

                                                                                                    5

                                                                                                    5

                                                                                                    Q5T

                                                                                                    Q1

                                                                                                    Q1T

                                                                                                    Q2

                                                                                                    Q2T

                                                                                                    Q3

                                                                                                    Q3T

                                                                                                    Q5

                                                                                                    Q4

                                                                                                    Q4T

                                                                                                    b+1

                                                                                                    b+1

                                                                                                    d+1

                                                                                                    d+1

                                                                                                    c

                                                                                                    c

                                                                                                    d+c

                                                                                                    d+c

                                                                                                    d+c

                                                                                                    d+c

                                                                                                    b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                                    Successive Band Reduction (BischofLangSun)

                                                                                                    1

                                                                                                    1

                                                                                                    2

                                                                                                    2

                                                                                                    3

                                                                                                    3

                                                                                                    4

                                                                                                    4

                                                                                                    5

                                                                                                    5

                                                                                                    6

                                                                                                    6

                                                                                                    Q5T

                                                                                                    Q1

                                                                                                    Q1T

                                                                                                    Q2

                                                                                                    Q2T

                                                                                                    Q3

                                                                                                    Q3T

                                                                                                    Q5

                                                                                                    Q4

                                                                                                    Q4T

                                                                                                    b+1

                                                                                                    b+1

                                                                                                    d+1

                                                                                                    d+1

                                                                                                    c

                                                                                                    c

                                                                                                    d+c

                                                                                                    d+c

                                                                                                    d+c

                                                                                                    d+c

                                                                                                    b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                                    Successive Band Reduction (BischofLangSun)

                                                                                                    Conventional vs CA - SBR

                                                                                                    Conventional Communication-Avoiding

                                                                                                    Touch all data 4 times Touch all data once

                                                                                                    >
                                                                                                    >

                                                                                                    Speedups of Sym Band Reductionvs DSBTRD

                                                                                                    bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

                                                                                                    bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

                                                                                                    bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

                                                                                                    bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

                                                                                                    bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

                                                                                                    Nonsymmetric Eigenproblem

                                                                                                    bull No apparent way to modify standard algorithmbull Instead Spectral Divide-and-Conquer

                                                                                                    ndash Find orthogonal matrix Q whose leading columns span an invariant subspace of A

                                                                                                    ndash QTAQ will be block upper triangularndash Apply recursively to A11 A22

                                                                                                    ndash Depends on randomization1 Randomized Rank Revealing QR decomposition2 Randomized location to try splitting spectrum

                                                                                                    A11 A12

                                                                                                    ε A22

                                                                                                    Attaining the Lower bounds SequentialLegend[Existing][Ours][Math-Lib][Random]

                                                                                                    Two Levels Memory Hierarchy

                                                                                                    Words Messages Words Messages

                                                                                                    BLAS-3 [FLPRrsquo99][BDLSTrsquo13][MKL etc] [FLPRrsquo99][BDLSTrsquo13][MKL etc]

                                                                                                    Cholesky[Grsquo97][APrsquo00]

                                                                                                    [LAPACK][BDHSrsquo09]

                                                                                                    [Grsquo97][APrsquo00][BDHSrsquo09] [Grsquo97][APrsquo00][BDHSrsquo09]

                                                                                                    Sym Indefinite [BBDDDPSTYrsquo13] [BBDDDPSTYrsquo13]

                                                                                                    LU[Grsquo97][Trsquo97]

                                                                                                    [GDXrsquo11][BDLSTrsquo13]

                                                                                                    [GDXrsquo11][BDLSTrsquo13]

                                                                                                    [Grsquo97][Trsquo97] [BDLSTrsquo13] [BDLSTrsquo13]

                                                                                                    QR[EGrsquo98][FWrsquo03]

                                                                                                    [DGHLrsquo12][BDLSTrsquo13]

                                                                                                    [FWrsquo03][DGHLrsquo12][BDLSTrsquo13]

                                                                                                    [EGrsquo98][FWrsquo03][BDLSTrsquo13]

                                                                                                    [FWrsquo03][BDLSTrsquo13]

                                                                                                    Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

                                                                                                    Sym Eig amp SVD [BDDrsquo11][BDKrsquo13] [BDDrsquo11]

                                                                                                    Non Sym Eig [BDDrsquo11] [BDDrsquo11]

                                                                                                    Legend[Existing][Ours][Math-Lib][Random]

                                                                                                    Words (BW) Messages (L) Saving factor

                                                                                                    BLAS-3 [AGZrsquo94][MTrsquo99][ScaLAPACK][Crsquo69][vGWrsquo97][SDrsquo11] L nP12

                                                                                                    Cholesky [ScaLAPACK][Trsquo99][SDrsquo11] L nP12

                                                                                                    Sym Indefinite [BBDDDPSTYrsquo13][ScaLAPACK] [BBDDDPSTYrsquo13] L nP12

                                                                                                    LU [ScaLAPACK][GDXrsquo11][Trsquo99][SDrsquo11] [GDXrsquo11][Trsquo99][SDrsquo11] L nP12

                                                                                                    QR [ScaLAPACK][DGHLrsquo12] [Trsquo99] [DGHLrsquo12][Trsquo99] L nP12

                                                                                                    Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

                                                                                                    Sym Eig amp SVD [BDDrsquo11][BDKrsquo13][ScaLAPACK] [BDDrsquo11][BDKrsquo13] L nP12

                                                                                                    Non-Sym Eig [BDDrsquo11] [BDDrsquo11] BW P12 L n

                                                                                                    Attaining with extra memory 25D M=(cn2P)

                                                                                                    Attaining the Lower bounds Parallel 2DM=(n2P)(Ignoring poly-log(P) factors words = ( n2 P12) messages = (P12)

                                                                                                    Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                    ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                    ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                    bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                    bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                    Avoiding Communication in Iterative Linear Algebra

                                                                                                    bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

                                                                                                    bull Conjugate Gradients (CG) GMRES Lanczos Arnoldi hellip bull Goal minimize communication

                                                                                                    ndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

                                                                                                    bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

                                                                                                    ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

                                                                                                    bull Lots of speed up possible (modeled and measured)ndash Price some redundant computationndash Challenges Poor partitioning Preconditioning Num Stability

                                                                                                    75

                                                                                                    Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                    ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                    ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                    bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                    bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                    Example The Difficulty of Tuning SpMV

                                                                                                    bull n = 21200bull nnz = 15 M

                                                                                                    bull Source NASA structural analysis problem (raefsky)

                                                                                                    77

                                                                                                    Example The Difficulty of Tuning

                                                                                                    bull n = 21200bull nnz = 15 M

                                                                                                    bull Source NASA structural analysis problem (raefsky)

                                                                                                    bull 8x8 dense substructure exploit this to limit mem_refs

                                                                                                    78

                                                                                                    Speedups on Itanium 2 The Need for Search

                                                                                                    Reference

                                                                                                    Best 4x2

                                                                                                    Mflops

                                                                                                    Mflops

                                                                                                    79

                                                                                                    Register Profile Itanium 2

                                                                                                    190 Mflops

                                                                                                    1190 Mflops

                                                                                                    80

                                                                                                    Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16

                                                                                                    Itanium 2 - 33Itanium 1 - 8

                                                                                                    252 Mflops

                                                                                                    122 Mflops

                                                                                                    820 Mflops

                                                                                                    459 Mflops

                                                                                                    247 Mflops

                                                                                                    107 Mflops

                                                                                                    12 Gflops

                                                                                                    190 Mflops

                                                                                                    Another example of tuning challenges for SpMV

                                                                                                    bull Ex11 matrix (fluid flow)

                                                                                                    bull More complicated non-zero structure in general

                                                                                                    bull N = 16614bull NNZ = 11M

                                                                                                    82

                                                                                                    Zoom in to top corner

                                                                                                    bull More complicated non-zero structure in general

                                                                                                    bull N = 16614bull NNZ = 11M

                                                                                                    83

                                                                                                    3x3 blocks look natural buthellip

                                                                                                    bull Example 3x3 blockingndash Logical grid of 3x3 cells

                                                                                                    bull But would lead to lots of ldquofill-inrdquo

                                                                                                    84

                                                                                                    Extra Work Can Improve Efficiency

                                                                                                    bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

                                                                                                    bull On Pentium III 15x speedupndash Actual mflop rate 152 = 225 higher

                                                                                                    85

                                                                                                    Source Accelerator Cavity Design Problem (Ko via Husbands)

                                                                                                    86

                                                                                                    100x100 Submatrix Along Diagonal

                                                                                                    Summer School Lecture 787

                                                                                                    Post-RCM Reordering

                                                                                                    88

                                                                                                    Effect of Combined RCM+TSP Reordering

                                                                                                    Before Green + RedAfter Green + Blue

                                                                                                    Summer School Lecture 789

                                                                                                    2x speedups on Pentium 4 Power 4 hellip

                                                                                                    Summary of Other Performance Optimizations

                                                                                                    bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

                                                                                                    bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

                                                                                                    bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

                                                                                                    90

                                                                                                    Optimized Sparse Kernel Interface - OSKI

                                                                                                    bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

                                                                                                    bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

                                                                                                    bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

                                                                                                    software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

                                                                                                    91

                                                                                                    Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                    ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                    ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                    bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                    bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                    93

                                                                                                    Example Classical Conjugate Gradient (CG)

                                                                                                    SpMVs and dot products require communication in

                                                                                                    each iteration

                                                                                                    via CA Matrix Powers Kernel

                                                                                                    Global reduction to compute G

                                                                                                    94

                                                                                                    Example CA-Conjugate Gradient

                                                                                                    Local computations within inner loop require

                                                                                                    no communication

                                                                                                    Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                    ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                    ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                    bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                    bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                    96

                                                                                                    Slower convergence due

                                                                                                    to roundoff

                                                                                                    Loss of accuracy due to roundoff

                                                                                                    At s = 16 monomial basis is rank deficient Method breaks down

                                                                                                    Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

                                                                                                    CA-CG (monomial)CG

                                                                                                    machine precision

                                                                                                    97

                                                                                                    Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                    ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                    ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                    bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                    bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                    What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

                                                                                                    bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

                                                                                                    bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

                                                                                                    matrices

                                                                                                    Explicit (O(nnz)) Implicit (o(nnz))

                                                                                                    Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

                                                                                                    Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

                                                                                                    Indices

                                                                                                    Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                    ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                    ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                    bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                    bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                    101

                                                                                                    bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

                                                                                                    ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

                                                                                                    demanded by customers (construction engineers) otherwise they donrsquot believe results

                                                                                                    ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

                                                                                                    ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

                                                                                                    Reproducible Floating Point Computation

                                                                                                    Absolute Error for Random Vectors

                                                                                                    Same magnitude opposite signs

                                                                                                    Intel MKL non-reproducibility

                                                                                                    Relative Error for Orthogonal vectors

                                                                                                    Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

                                                                                                    Sign notreproducible

                                                                                                    103

                                                                                                    bull Consider summation or dot productbull Goals

                                                                                                    1 Same answer independent of layout processors order of summands

                                                                                                    2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

                                                                                                    bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

                                                                                                    GoalsApproaches for Reproducibility

                                                                                                    104

                                                                                                    Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

                                                                                                    Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

                                                                                                    Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

                                                                                                    bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

                                                                                                    Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

                                                                                                    bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

                                                                                                    Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

                                                                                                    Instruments NEC Nokia NVIDIA Samsung Oracle

                                                                                                    bull bebopcsberkeleyedu

                                                                                                    Summary

                                                                                                    Donrsquot Communichellip

                                                                                                    106

                                                                                                    Time to redesign all linear algebra n-body hellip algorithms and software

                                                                                                    (and compilers)

                                                                                                    • Implementing Communication-Avoiding Algorithms
                                                                                                    • Why avoid communication
                                                                                                    • Goals
                                                                                                    • Outline
                                                                                                    • Outline (2)
                                                                                                    • Lower bound for all ldquon3-likerdquo linear algebra
                                                                                                    • Lower bound for all ldquon3-likerdquo linear algebra (2)
                                                                                                    • Lower bound for all ldquon3-likerdquo linear algebra (3)
                                                                                                    • Limits to parallel scaling (12)
                                                                                                    • Limits to parallel scaling (22)
                                                                                                    • Can we attain these lower bounds
                                                                                                    • Outline (3)
                                                                                                    • 25D Matrix Multiplication
                                                                                                    • 25D Matrix Multiplication (2)
                                                                                                    • 25D Matmul on BGP 16K nodes 64K cores (2)
                                                                                                    • Perfect Strong Scaling ndash in Time and Energy (12)
                                                                                                    • Perfect Strong Scaling ndash in Time and Energy (22)
                                                                                                    • Handling Heterogeneity
                                                                                                    • Application to Tensor Contractions
                                                                                                    • C(ijk) = Σm A(ijm)B(mk)
                                                                                                    • Application to Tensor Contractions (2)
                                                                                                    • Communication Lower Bounds for Strassen-like matmul algorithms
                                                                                                    • vs
                                                                                                    • Slide 26
                                                                                                    • Strassen-like beyond matmul
                                                                                                    • Cache and Network Oblivious Algorithms
                                                                                                    • CARMA Performance Distributed Memory
                                                                                                    • CARMA Performance Distributed Memory (2)
                                                                                                    • CARMA Performance Shared Memory
                                                                                                    • CARMA Performance Shared Memory (2)
                                                                                                    • Why is CARMA Faster in Shared Memory
                                                                                                    • Outline (4)
                                                                                                    • One-sided Factorizations (LU QR) so far
                                                                                                    • TSQR An Architecture-Dependent Algorithm
                                                                                                    • Back to LU Using similar idea for TSLU as TSQR Use reduction
                                                                                                    • Minimizing Communication in TSLU
                                                                                                    • Making TSLU Numerically Stable
                                                                                                    • Stability of LU using TSLU CALU
                                                                                                    • Why is stability of TSLU just a ldquoThmrdquo
                                                                                                    • Fixing TSLU
                                                                                                    • 2D CALU with Tournament Pivoting
                                                                                                    • 25D CALU with Tournament Pivoting (c=4 copies)
                                                                                                    • Exascale Machine Parameters Source DOE Exascale Workshop
                                                                                                    • Exascale predicted speedups for Gaussian Elimination 2D CA
                                                                                                    • 25D vs 2D LU With and Without Pivoting
                                                                                                    • Other CA algorithms for Ax=b least squares(13)
                                                                                                    • Other CA algorithms for Ax=b least squares (23)
                                                                                                    • Other CA algorithms for Ax=b least squares (33)
                                                                                                    • Outline (5)
                                                                                                    • What about sparse matrices (13)
                                                                                                    • Performance of 25D APSP using Kleene
                                                                                                    • What about sparse matrices (23)
                                                                                                    • What about sparse matrices (33)
                                                                                                    • Outline (6)
                                                                                                    • Symmetric Eigenproblem and SVD
                                                                                                    • Slide 58
                                                                                                    • Slide 59
                                                                                                    • Slide 60
                                                                                                    • Slide 61
                                                                                                    • Slide 62
                                                                                                    • Slide 63
                                                                                                    • Slide 64
                                                                                                    • Slide 65
                                                                                                    • Slide 66
                                                                                                    • Slide 67
                                                                                                    • Slide 68
                                                                                                    • Conventional vs CA - SBR
                                                                                                    • Speedups of Sym Band Reduction vs DSBTRD
                                                                                                    • Nonsymmetric Eigenproblem
                                                                                                    • Attaining the Lower bounds Sequential
                                                                                                    • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
                                                                                                    • Outline (7)
                                                                                                    • Avoiding Communication in Iterative Linear Algebra
                                                                                                    • Outline (8)
                                                                                                    • Example The Difficulty of Tuning SpMV
                                                                                                    • Example The Difficulty of Tuning
                                                                                                    • Speedups on Itanium 2 The Need for Search
                                                                                                    • Register Profile Itanium 2
                                                                                                    • Register Profiles IBM and Intel IA-64
                                                                                                    • Another example of tuning challenges for SpMV
                                                                                                    • Zoom in to top corner
                                                                                                    • 3x3 blocks look natural buthellip
                                                                                                    • Extra Work Can Improve Efficiency
                                                                                                    • Slide 86
                                                                                                    • Slide 87
                                                                                                    • Slide 88
                                                                                                    • Slide 89
                                                                                                    • Summary of Other Performance Optimizations
                                                                                                    • Optimized Sparse Kernel Interface - OSKI
                                                                                                    • Outline (9)
                                                                                                    • Example Classical Conjugate Gradient (CG)
                                                                                                    • Example CA-Conjugate Gradient
                                                                                                    • Outline (10)
                                                                                                    • Slide 96
                                                                                                    • Slide 97
                                                                                                    • Outline (11)
                                                                                                    • What is a ldquosparse matrixrdquo
                                                                                                    • Outline (12)
                                                                                                    • Reproducible Floating Point Computation
                                                                                                    • Intel MKL non-reproducibility
                                                                                                    • GoalsApproaches for Reproducibility
                                                                                                    • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
                                                                                                    • Collaborators and Supporters
                                                                                                    • Summary

                                                                                                      Performance of 25D APSP using Kleene

                                                                                                      53

                                                                                                      Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

                                                                                                      62xspeedup

                                                                                                      2x speedup

                                                                                                      What about sparse matrices (23)

                                                                                                      bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of G(A)

                                                                                                      have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

                                                                                                      bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering for

                                                                                                      2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

                                                                                                      bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

                                                                                                      multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

                                                                                                      separators)

                                                                                                      54

                                                                                                      What about sparse matrices (33)

                                                                                                      bull If matrix stays very sparse lower bound unattainable new one bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

                                                                                                      data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

                                                                                                      bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

                                                                                                      Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

                                                                                                      bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

                                                                                                      along dimensions most likely to minimize cost55

                                                                                                      Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                      ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                      ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                      bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                      bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                      Symmetric Eigenproblem and SVD

                                                                                                      bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd

                                                                                                      bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12

                                                                                                      ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea

                                                                                                      b+1

                                                                                                      b+1

                                                                                                      Successive Band Reduction (BischofLangSun)

                                                                                                      1

                                                                                                      b+1

                                                                                                      b+1

                                                                                                      d+1

                                                                                                      c

                                                                                                      Successive Band Reduction (BischofLangSun)

                                                                                                      b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                                      1Q1

                                                                                                      b+1

                                                                                                      b+1

                                                                                                      d+1

                                                                                                      c

                                                                                                      b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                                      Successive Band Reduction (BischofLangSun)

                                                                                                      12

                                                                                                      Q1

                                                                                                      b+1

                                                                                                      b+1

                                                                                                      d+1

                                                                                                      d+c

                                                                                                      d+c

                                                                                                      c

                                                                                                      b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                                      Successive Band Reduction (BischofLangSun)

                                                                                                      1

                                                                                                      12

                                                                                                      Q1

                                                                                                      Q1T

                                                                                                      b+1

                                                                                                      b+1

                                                                                                      d+1

                                                                                                      d+1

                                                                                                      cd+c

                                                                                                      d+c

                                                                                                      c

                                                                                                      b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                                      Successive Band Reduction (BischofLangSun)

                                                                                                      1

                                                                                                      1

                                                                                                      2

                                                                                                      2Q1

                                                                                                      Q1T

                                                                                                      b+1

                                                                                                      b+1

                                                                                                      d+1

                                                                                                      d+1

                                                                                                      cd+c

                                                                                                      d+c

                                                                                                      d+c

                                                                                                      d+c

                                                                                                      c

                                                                                                      b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                                      Successive Band Reduction (BischofLangSun)

                                                                                                      1

                                                                                                      1

                                                                                                      2

                                                                                                      2

                                                                                                      3

                                                                                                      3

                                                                                                      Q1

                                                                                                      Q1T

                                                                                                      Q2

                                                                                                      Q2T

                                                                                                      b+1

                                                                                                      b+1

                                                                                                      d+1

                                                                                                      d+1

                                                                                                      d+c

                                                                                                      d+c

                                                                                                      d+c

                                                                                                      d+c

                                                                                                      c

                                                                                                      c

                                                                                                      b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                                      Successive Band Reduction (BischofLangSun)

                                                                                                      1

                                                                                                      1

                                                                                                      2

                                                                                                      2

                                                                                                      3

                                                                                                      3

                                                                                                      4

                                                                                                      4

                                                                                                      Q1

                                                                                                      Q1T

                                                                                                      Q2

                                                                                                      Q2T

                                                                                                      Q3

                                                                                                      Q3T

                                                                                                      b+1

                                                                                                      b+1

                                                                                                      d+1

                                                                                                      d+1

                                                                                                      d+c

                                                                                                      d+c

                                                                                                      d+c

                                                                                                      d+c

                                                                                                      c

                                                                                                      c

                                                                                                      b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                                      Successive Band Reduction (BischofLangSun)

                                                                                                      1

                                                                                                      1

                                                                                                      2

                                                                                                      2

                                                                                                      3

                                                                                                      3

                                                                                                      4

                                                                                                      4

                                                                                                      5

                                                                                                      5

                                                                                                      Q1

                                                                                                      Q1T

                                                                                                      Q2

                                                                                                      Q2T

                                                                                                      Q3

                                                                                                      Q3T

                                                                                                      Q4

                                                                                                      Q4T

                                                                                                      b+1

                                                                                                      b+1

                                                                                                      d+1

                                                                                                      d+1

                                                                                                      c

                                                                                                      c

                                                                                                      d+c

                                                                                                      d+c

                                                                                                      d+c

                                                                                                      d+c

                                                                                                      b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                                      Successive Band Reduction (BischofLangSun)

                                                                                                      1

                                                                                                      1

                                                                                                      2

                                                                                                      2

                                                                                                      3

                                                                                                      3

                                                                                                      4

                                                                                                      4

                                                                                                      5

                                                                                                      5

                                                                                                      Q5T

                                                                                                      Q1

                                                                                                      Q1T

                                                                                                      Q2

                                                                                                      Q2T

                                                                                                      Q3

                                                                                                      Q3T

                                                                                                      Q5

                                                                                                      Q4

                                                                                                      Q4T

                                                                                                      b+1

                                                                                                      b+1

                                                                                                      d+1

                                                                                                      d+1

                                                                                                      c

                                                                                                      c

                                                                                                      d+c

                                                                                                      d+c

                                                                                                      d+c

                                                                                                      d+c

                                                                                                      b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                                      Successive Band Reduction (BischofLangSun)

                                                                                                      1

                                                                                                      1

                                                                                                      2

                                                                                                      2

                                                                                                      3

                                                                                                      3

                                                                                                      4

                                                                                                      4

                                                                                                      5

                                                                                                      5

                                                                                                      6

                                                                                                      6

                                                                                                      Q5T

                                                                                                      Q1

                                                                                                      Q1T

                                                                                                      Q2

                                                                                                      Q2T

                                                                                                      Q3

                                                                                                      Q3T

                                                                                                      Q5

                                                                                                      Q4

                                                                                                      Q4T

                                                                                                      b+1

                                                                                                      b+1

                                                                                                      d+1

                                                                                                      d+1

                                                                                                      c

                                                                                                      c

                                                                                                      d+c

                                                                                                      d+c

                                                                                                      d+c

                                                                                                      d+c

                                                                                                      b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                                      Successive Band Reduction (BischofLangSun)

                                                                                                      Conventional vs CA - SBR

                                                                                                      Conventional Communication-Avoiding

                                                                                                      Touch all data 4 times Touch all data once

                                                                                                      >
                                                                                                      >

                                                                                                      Speedups of Sym Band Reductionvs DSBTRD

                                                                                                      bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

                                                                                                      bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

                                                                                                      bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

                                                                                                      bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

                                                                                                      bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

                                                                                                      Nonsymmetric Eigenproblem

                                                                                                      bull No apparent way to modify standard algorithmbull Instead Spectral Divide-and-Conquer

                                                                                                      ndash Find orthogonal matrix Q whose leading columns span an invariant subspace of A

                                                                                                      ndash QTAQ will be block upper triangularndash Apply recursively to A11 A22

                                                                                                      ndash Depends on randomization1 Randomized Rank Revealing QR decomposition2 Randomized location to try splitting spectrum

                                                                                                      A11 A12

                                                                                                      ε A22

                                                                                                      Attaining the Lower bounds SequentialLegend[Existing][Ours][Math-Lib][Random]

                                                                                                      Two Levels Memory Hierarchy

                                                                                                      Words Messages Words Messages

                                                                                                      BLAS-3 [FLPRrsquo99][BDLSTrsquo13][MKL etc] [FLPRrsquo99][BDLSTrsquo13][MKL etc]

                                                                                                      Cholesky[Grsquo97][APrsquo00]

                                                                                                      [LAPACK][BDHSrsquo09]

                                                                                                      [Grsquo97][APrsquo00][BDHSrsquo09] [Grsquo97][APrsquo00][BDHSrsquo09]

                                                                                                      Sym Indefinite [BBDDDPSTYrsquo13] [BBDDDPSTYrsquo13]

                                                                                                      LU[Grsquo97][Trsquo97]

                                                                                                      [GDXrsquo11][BDLSTrsquo13]

                                                                                                      [GDXrsquo11][BDLSTrsquo13]

                                                                                                      [Grsquo97][Trsquo97] [BDLSTrsquo13] [BDLSTrsquo13]

                                                                                                      QR[EGrsquo98][FWrsquo03]

                                                                                                      [DGHLrsquo12][BDLSTrsquo13]

                                                                                                      [FWrsquo03][DGHLrsquo12][BDLSTrsquo13]

                                                                                                      [EGrsquo98][FWrsquo03][BDLSTrsquo13]

                                                                                                      [FWrsquo03][BDLSTrsquo13]

                                                                                                      Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

                                                                                                      Sym Eig amp SVD [BDDrsquo11][BDKrsquo13] [BDDrsquo11]

                                                                                                      Non Sym Eig [BDDrsquo11] [BDDrsquo11]

                                                                                                      Legend[Existing][Ours][Math-Lib][Random]

                                                                                                      Words (BW) Messages (L) Saving factor

                                                                                                      BLAS-3 [AGZrsquo94][MTrsquo99][ScaLAPACK][Crsquo69][vGWrsquo97][SDrsquo11] L nP12

                                                                                                      Cholesky [ScaLAPACK][Trsquo99][SDrsquo11] L nP12

                                                                                                      Sym Indefinite [BBDDDPSTYrsquo13][ScaLAPACK] [BBDDDPSTYrsquo13] L nP12

                                                                                                      LU [ScaLAPACK][GDXrsquo11][Trsquo99][SDrsquo11] [GDXrsquo11][Trsquo99][SDrsquo11] L nP12

                                                                                                      QR [ScaLAPACK][DGHLrsquo12] [Trsquo99] [DGHLrsquo12][Trsquo99] L nP12

                                                                                                      Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

                                                                                                      Sym Eig amp SVD [BDDrsquo11][BDKrsquo13][ScaLAPACK] [BDDrsquo11][BDKrsquo13] L nP12

                                                                                                      Non-Sym Eig [BDDrsquo11] [BDDrsquo11] BW P12 L n

                                                                                                      Attaining with extra memory 25D M=(cn2P)

                                                                                                      Attaining the Lower bounds Parallel 2DM=(n2P)(Ignoring poly-log(P) factors words = ( n2 P12) messages = (P12)

                                                                                                      Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                      ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                      ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                      bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                      bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                      Avoiding Communication in Iterative Linear Algebra

                                                                                                      bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

                                                                                                      bull Conjugate Gradients (CG) GMRES Lanczos Arnoldi hellip bull Goal minimize communication

                                                                                                      ndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

                                                                                                      bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

                                                                                                      ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

                                                                                                      bull Lots of speed up possible (modeled and measured)ndash Price some redundant computationndash Challenges Poor partitioning Preconditioning Num Stability

                                                                                                      75

                                                                                                      Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                      ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                      ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                      bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                      bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                      Example The Difficulty of Tuning SpMV

                                                                                                      bull n = 21200bull nnz = 15 M

                                                                                                      bull Source NASA structural analysis problem (raefsky)

                                                                                                      77

                                                                                                      Example The Difficulty of Tuning

                                                                                                      bull n = 21200bull nnz = 15 M

                                                                                                      bull Source NASA structural analysis problem (raefsky)

                                                                                                      bull 8x8 dense substructure exploit this to limit mem_refs

                                                                                                      78

                                                                                                      Speedups on Itanium 2 The Need for Search

                                                                                                      Reference

                                                                                                      Best 4x2

                                                                                                      Mflops

                                                                                                      Mflops

                                                                                                      79

                                                                                                      Register Profile Itanium 2

                                                                                                      190 Mflops

                                                                                                      1190 Mflops

                                                                                                      80

                                                                                                      Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16

                                                                                                      Itanium 2 - 33Itanium 1 - 8

                                                                                                      252 Mflops

                                                                                                      122 Mflops

                                                                                                      820 Mflops

                                                                                                      459 Mflops

                                                                                                      247 Mflops

                                                                                                      107 Mflops

                                                                                                      12 Gflops

                                                                                                      190 Mflops

                                                                                                      Another example of tuning challenges for SpMV

                                                                                                      bull Ex11 matrix (fluid flow)

                                                                                                      bull More complicated non-zero structure in general

                                                                                                      bull N = 16614bull NNZ = 11M

                                                                                                      82

                                                                                                      Zoom in to top corner

                                                                                                      bull More complicated non-zero structure in general

                                                                                                      bull N = 16614bull NNZ = 11M

                                                                                                      83

                                                                                                      3x3 blocks look natural buthellip

                                                                                                      bull Example 3x3 blockingndash Logical grid of 3x3 cells

                                                                                                      bull But would lead to lots of ldquofill-inrdquo

                                                                                                      84

                                                                                                      Extra Work Can Improve Efficiency

                                                                                                      bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

                                                                                                      bull On Pentium III 15x speedupndash Actual mflop rate 152 = 225 higher

                                                                                                      85

                                                                                                      Source Accelerator Cavity Design Problem (Ko via Husbands)

                                                                                                      86

                                                                                                      100x100 Submatrix Along Diagonal

                                                                                                      Summer School Lecture 787

                                                                                                      Post-RCM Reordering

                                                                                                      88

                                                                                                      Effect of Combined RCM+TSP Reordering

                                                                                                      Before Green + RedAfter Green + Blue

                                                                                                      Summer School Lecture 789

                                                                                                      2x speedups on Pentium 4 Power 4 hellip

                                                                                                      Summary of Other Performance Optimizations

                                                                                                      bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

                                                                                                      bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

                                                                                                      bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

                                                                                                      90

                                                                                                      Optimized Sparse Kernel Interface - OSKI

                                                                                                      bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

                                                                                                      bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

                                                                                                      bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

                                                                                                      software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

                                                                                                      91

                                                                                                      Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                      ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                      ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                      bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                      bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                      93

                                                                                                      Example Classical Conjugate Gradient (CG)

                                                                                                      SpMVs and dot products require communication in

                                                                                                      each iteration

                                                                                                      via CA Matrix Powers Kernel

                                                                                                      Global reduction to compute G

                                                                                                      94

                                                                                                      Example CA-Conjugate Gradient

                                                                                                      Local computations within inner loop require

                                                                                                      no communication

                                                                                                      Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                      ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                      ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                      bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                      bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                      96

                                                                                                      Slower convergence due

                                                                                                      to roundoff

                                                                                                      Loss of accuracy due to roundoff

                                                                                                      At s = 16 monomial basis is rank deficient Method breaks down

                                                                                                      Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

                                                                                                      CA-CG (monomial)CG

                                                                                                      machine precision

                                                                                                      97

                                                                                                      Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                      ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                      ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                      bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                      bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                      What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

                                                                                                      bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

                                                                                                      bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

                                                                                                      matrices

                                                                                                      Explicit (O(nnz)) Implicit (o(nnz))

                                                                                                      Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

                                                                                                      Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

                                                                                                      Indices

                                                                                                      Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                      ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                      ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                      bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                      bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                      101

                                                                                                      bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

                                                                                                      ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

                                                                                                      demanded by customers (construction engineers) otherwise they donrsquot believe results

                                                                                                      ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

                                                                                                      ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

                                                                                                      Reproducible Floating Point Computation

                                                                                                      Absolute Error for Random Vectors

                                                                                                      Same magnitude opposite signs

                                                                                                      Intel MKL non-reproducibility

                                                                                                      Relative Error for Orthogonal vectors

                                                                                                      Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

                                                                                                      Sign notreproducible

                                                                                                      103

                                                                                                      bull Consider summation or dot productbull Goals

                                                                                                      1 Same answer independent of layout processors order of summands

                                                                                                      2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

                                                                                                      bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

                                                                                                      GoalsApproaches for Reproducibility

                                                                                                      104

                                                                                                      Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

                                                                                                      Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

                                                                                                      Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

                                                                                                      bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

                                                                                                      Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

                                                                                                      bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

                                                                                                      Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

                                                                                                      Instruments NEC Nokia NVIDIA Samsung Oracle

                                                                                                      bull bebopcsberkeleyedu

                                                                                                      Summary

                                                                                                      Donrsquot Communichellip

                                                                                                      106

                                                                                                      Time to redesign all linear algebra n-body hellip algorithms and software

                                                                                                      (and compilers)

                                                                                                      • Implementing Communication-Avoiding Algorithms
                                                                                                      • Why avoid communication
                                                                                                      • Goals
                                                                                                      • Outline
                                                                                                      • Outline (2)
                                                                                                      • Lower bound for all ldquon3-likerdquo linear algebra
                                                                                                      • Lower bound for all ldquon3-likerdquo linear algebra (2)
                                                                                                      • Lower bound for all ldquon3-likerdquo linear algebra (3)
                                                                                                      • Limits to parallel scaling (12)
                                                                                                      • Limits to parallel scaling (22)
                                                                                                      • Can we attain these lower bounds
                                                                                                      • Outline (3)
                                                                                                      • 25D Matrix Multiplication
                                                                                                      • 25D Matrix Multiplication (2)
                                                                                                      • 25D Matmul on BGP 16K nodes 64K cores (2)
                                                                                                      • Perfect Strong Scaling ndash in Time and Energy (12)
                                                                                                      • Perfect Strong Scaling ndash in Time and Energy (22)
                                                                                                      • Handling Heterogeneity
                                                                                                      • Application to Tensor Contractions
                                                                                                      • C(ijk) = Σm A(ijm)B(mk)
                                                                                                      • Application to Tensor Contractions (2)
                                                                                                      • Communication Lower Bounds for Strassen-like matmul algorithms
                                                                                                      • vs
                                                                                                      • Slide 26
                                                                                                      • Strassen-like beyond matmul
                                                                                                      • Cache and Network Oblivious Algorithms
                                                                                                      • CARMA Performance Distributed Memory
                                                                                                      • CARMA Performance Distributed Memory (2)
                                                                                                      • CARMA Performance Shared Memory
                                                                                                      • CARMA Performance Shared Memory (2)
                                                                                                      • Why is CARMA Faster in Shared Memory
                                                                                                      • Outline (4)
                                                                                                      • One-sided Factorizations (LU QR) so far
                                                                                                      • TSQR An Architecture-Dependent Algorithm
                                                                                                      • Back to LU Using similar idea for TSLU as TSQR Use reduction
                                                                                                      • Minimizing Communication in TSLU
                                                                                                      • Making TSLU Numerically Stable
                                                                                                      • Stability of LU using TSLU CALU
                                                                                                      • Why is stability of TSLU just a ldquoThmrdquo
                                                                                                      • Fixing TSLU
                                                                                                      • 2D CALU with Tournament Pivoting
                                                                                                      • 25D CALU with Tournament Pivoting (c=4 copies)
                                                                                                      • Exascale Machine Parameters Source DOE Exascale Workshop
                                                                                                      • Exascale predicted speedups for Gaussian Elimination 2D CA
                                                                                                      • 25D vs 2D LU With and Without Pivoting
                                                                                                      • Other CA algorithms for Ax=b least squares(13)
                                                                                                      • Other CA algorithms for Ax=b least squares (23)
                                                                                                      • Other CA algorithms for Ax=b least squares (33)
                                                                                                      • Outline (5)
                                                                                                      • What about sparse matrices (13)
                                                                                                      • Performance of 25D APSP using Kleene
                                                                                                      • What about sparse matrices (23)
                                                                                                      • What about sparse matrices (33)
                                                                                                      • Outline (6)
                                                                                                      • Symmetric Eigenproblem and SVD
                                                                                                      • Slide 58
                                                                                                      • Slide 59
                                                                                                      • Slide 60
                                                                                                      • Slide 61
                                                                                                      • Slide 62
                                                                                                      • Slide 63
                                                                                                      • Slide 64
                                                                                                      • Slide 65
                                                                                                      • Slide 66
                                                                                                      • Slide 67
                                                                                                      • Slide 68
                                                                                                      • Conventional vs CA - SBR
                                                                                                      • Speedups of Sym Band Reduction vs DSBTRD
                                                                                                      • Nonsymmetric Eigenproblem
                                                                                                      • Attaining the Lower bounds Sequential
                                                                                                      • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
                                                                                                      • Outline (7)
                                                                                                      • Avoiding Communication in Iterative Linear Algebra
                                                                                                      • Outline (8)
                                                                                                      • Example The Difficulty of Tuning SpMV
                                                                                                      • Example The Difficulty of Tuning
                                                                                                      • Speedups on Itanium 2 The Need for Search
                                                                                                      • Register Profile Itanium 2
                                                                                                      • Register Profiles IBM and Intel IA-64
                                                                                                      • Another example of tuning challenges for SpMV
                                                                                                      • Zoom in to top corner
                                                                                                      • 3x3 blocks look natural buthellip
                                                                                                      • Extra Work Can Improve Efficiency
                                                                                                      • Slide 86
                                                                                                      • Slide 87
                                                                                                      • Slide 88
                                                                                                      • Slide 89
                                                                                                      • Summary of Other Performance Optimizations
                                                                                                      • Optimized Sparse Kernel Interface - OSKI
                                                                                                      • Outline (9)
                                                                                                      • Example Classical Conjugate Gradient (CG)
                                                                                                      • Example CA-Conjugate Gradient
                                                                                                      • Outline (10)
                                                                                                      • Slide 96
                                                                                                      • Slide 97
                                                                                                      • Outline (11)
                                                                                                      • What is a ldquosparse matrixrdquo
                                                                                                      • Outline (12)
                                                                                                      • Reproducible Floating Point Computation
                                                                                                      • Intel MKL non-reproducibility
                                                                                                      • GoalsApproaches for Reproducibility
                                                                                                      • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
                                                                                                      • Collaborators and Supporters
                                                                                                      • Summary

                                                                                                        What about sparse matrices (23)

                                                                                                        bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of G(A)

                                                                                                        have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

                                                                                                        bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering for

                                                                                                        2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

                                                                                                        bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

                                                                                                        multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

                                                                                                        separators)

                                                                                                        54

                                                                                                        What about sparse matrices (33)

                                                                                                        bull If matrix stays very sparse lower bound unattainable new one bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

                                                                                                        data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

                                                                                                        bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

                                                                                                        Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

                                                                                                        bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

                                                                                                        along dimensions most likely to minimize cost55

                                                                                                        Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                        ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                        ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                        bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                        bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                        Symmetric Eigenproblem and SVD

                                                                                                        bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd

                                                                                                        bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12

                                                                                                        ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea

                                                                                                        b+1

                                                                                                        b+1

                                                                                                        Successive Band Reduction (BischofLangSun)

                                                                                                        1

                                                                                                        b+1

                                                                                                        b+1

                                                                                                        d+1

                                                                                                        c

                                                                                                        Successive Band Reduction (BischofLangSun)

                                                                                                        b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                                        1Q1

                                                                                                        b+1

                                                                                                        b+1

                                                                                                        d+1

                                                                                                        c

                                                                                                        b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                                        Successive Band Reduction (BischofLangSun)

                                                                                                        12

                                                                                                        Q1

                                                                                                        b+1

                                                                                                        b+1

                                                                                                        d+1

                                                                                                        d+c

                                                                                                        d+c

                                                                                                        c

                                                                                                        b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                                        Successive Band Reduction (BischofLangSun)

                                                                                                        1

                                                                                                        12

                                                                                                        Q1

                                                                                                        Q1T

                                                                                                        b+1

                                                                                                        b+1

                                                                                                        d+1

                                                                                                        d+1

                                                                                                        cd+c

                                                                                                        d+c

                                                                                                        c

                                                                                                        b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                                        Successive Band Reduction (BischofLangSun)

                                                                                                        1

                                                                                                        1

                                                                                                        2

                                                                                                        2Q1

                                                                                                        Q1T

                                                                                                        b+1

                                                                                                        b+1

                                                                                                        d+1

                                                                                                        d+1

                                                                                                        cd+c

                                                                                                        d+c

                                                                                                        d+c

                                                                                                        d+c

                                                                                                        c

                                                                                                        b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                                        Successive Band Reduction (BischofLangSun)

                                                                                                        1

                                                                                                        1

                                                                                                        2

                                                                                                        2

                                                                                                        3

                                                                                                        3

                                                                                                        Q1

                                                                                                        Q1T

                                                                                                        Q2

                                                                                                        Q2T

                                                                                                        b+1

                                                                                                        b+1

                                                                                                        d+1

                                                                                                        d+1

                                                                                                        d+c

                                                                                                        d+c

                                                                                                        d+c

                                                                                                        d+c

                                                                                                        c

                                                                                                        c

                                                                                                        b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                                        Successive Band Reduction (BischofLangSun)

                                                                                                        1

                                                                                                        1

                                                                                                        2

                                                                                                        2

                                                                                                        3

                                                                                                        3

                                                                                                        4

                                                                                                        4

                                                                                                        Q1

                                                                                                        Q1T

                                                                                                        Q2

                                                                                                        Q2T

                                                                                                        Q3

                                                                                                        Q3T

                                                                                                        b+1

                                                                                                        b+1

                                                                                                        d+1

                                                                                                        d+1

                                                                                                        d+c

                                                                                                        d+c

                                                                                                        d+c

                                                                                                        d+c

                                                                                                        c

                                                                                                        c

                                                                                                        b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                                        Successive Band Reduction (BischofLangSun)

                                                                                                        1

                                                                                                        1

                                                                                                        2

                                                                                                        2

                                                                                                        3

                                                                                                        3

                                                                                                        4

                                                                                                        4

                                                                                                        5

                                                                                                        5

                                                                                                        Q1

                                                                                                        Q1T

                                                                                                        Q2

                                                                                                        Q2T

                                                                                                        Q3

                                                                                                        Q3T

                                                                                                        Q4

                                                                                                        Q4T

                                                                                                        b+1

                                                                                                        b+1

                                                                                                        d+1

                                                                                                        d+1

                                                                                                        c

                                                                                                        c

                                                                                                        d+c

                                                                                                        d+c

                                                                                                        d+c

                                                                                                        d+c

                                                                                                        b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                                        Successive Band Reduction (BischofLangSun)

                                                                                                        1

                                                                                                        1

                                                                                                        2

                                                                                                        2

                                                                                                        3

                                                                                                        3

                                                                                                        4

                                                                                                        4

                                                                                                        5

                                                                                                        5

                                                                                                        Q5T

                                                                                                        Q1

                                                                                                        Q1T

                                                                                                        Q2

                                                                                                        Q2T

                                                                                                        Q3

                                                                                                        Q3T

                                                                                                        Q5

                                                                                                        Q4

                                                                                                        Q4T

                                                                                                        b+1

                                                                                                        b+1

                                                                                                        d+1

                                                                                                        d+1

                                                                                                        c

                                                                                                        c

                                                                                                        d+c

                                                                                                        d+c

                                                                                                        d+c

                                                                                                        d+c

                                                                                                        b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                                        Successive Band Reduction (BischofLangSun)

                                                                                                        1

                                                                                                        1

                                                                                                        2

                                                                                                        2

                                                                                                        3

                                                                                                        3

                                                                                                        4

                                                                                                        4

                                                                                                        5

                                                                                                        5

                                                                                                        6

                                                                                                        6

                                                                                                        Q5T

                                                                                                        Q1

                                                                                                        Q1T

                                                                                                        Q2

                                                                                                        Q2T

                                                                                                        Q3

                                                                                                        Q3T

                                                                                                        Q5

                                                                                                        Q4

                                                                                                        Q4T

                                                                                                        b+1

                                                                                                        b+1

                                                                                                        d+1

                                                                                                        d+1

                                                                                                        c

                                                                                                        c

                                                                                                        d+c

                                                                                                        d+c

                                                                                                        d+c

                                                                                                        d+c

                                                                                                        b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                                        Successive Band Reduction (BischofLangSun)

                                                                                                        Conventional vs CA - SBR

                                                                                                        Conventional Communication-Avoiding

                                                                                                        Touch all data 4 times Touch all data once

                                                                                                        >
                                                                                                        >

                                                                                                        Speedups of Sym Band Reductionvs DSBTRD

                                                                                                        bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

                                                                                                        bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

                                                                                                        bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

                                                                                                        bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

                                                                                                        bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

                                                                                                        Nonsymmetric Eigenproblem

                                                                                                        bull No apparent way to modify standard algorithmbull Instead Spectral Divide-and-Conquer

                                                                                                        ndash Find orthogonal matrix Q whose leading columns span an invariant subspace of A

                                                                                                        ndash QTAQ will be block upper triangularndash Apply recursively to A11 A22

                                                                                                        ndash Depends on randomization1 Randomized Rank Revealing QR decomposition2 Randomized location to try splitting spectrum

                                                                                                        A11 A12

                                                                                                        ε A22

                                                                                                        Attaining the Lower bounds SequentialLegend[Existing][Ours][Math-Lib][Random]

                                                                                                        Two Levels Memory Hierarchy

                                                                                                        Words Messages Words Messages

                                                                                                        BLAS-3 [FLPRrsquo99][BDLSTrsquo13][MKL etc] [FLPRrsquo99][BDLSTrsquo13][MKL etc]

                                                                                                        Cholesky[Grsquo97][APrsquo00]

                                                                                                        [LAPACK][BDHSrsquo09]

                                                                                                        [Grsquo97][APrsquo00][BDHSrsquo09] [Grsquo97][APrsquo00][BDHSrsquo09]

                                                                                                        Sym Indefinite [BBDDDPSTYrsquo13] [BBDDDPSTYrsquo13]

                                                                                                        LU[Grsquo97][Trsquo97]

                                                                                                        [GDXrsquo11][BDLSTrsquo13]

                                                                                                        [GDXrsquo11][BDLSTrsquo13]

                                                                                                        [Grsquo97][Trsquo97] [BDLSTrsquo13] [BDLSTrsquo13]

                                                                                                        QR[EGrsquo98][FWrsquo03]

                                                                                                        [DGHLrsquo12][BDLSTrsquo13]

                                                                                                        [FWrsquo03][DGHLrsquo12][BDLSTrsquo13]

                                                                                                        [EGrsquo98][FWrsquo03][BDLSTrsquo13]

                                                                                                        [FWrsquo03][BDLSTrsquo13]

                                                                                                        Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

                                                                                                        Sym Eig amp SVD [BDDrsquo11][BDKrsquo13] [BDDrsquo11]

                                                                                                        Non Sym Eig [BDDrsquo11] [BDDrsquo11]

                                                                                                        Legend[Existing][Ours][Math-Lib][Random]

                                                                                                        Words (BW) Messages (L) Saving factor

                                                                                                        BLAS-3 [AGZrsquo94][MTrsquo99][ScaLAPACK][Crsquo69][vGWrsquo97][SDrsquo11] L nP12

                                                                                                        Cholesky [ScaLAPACK][Trsquo99][SDrsquo11] L nP12

                                                                                                        Sym Indefinite [BBDDDPSTYrsquo13][ScaLAPACK] [BBDDDPSTYrsquo13] L nP12

                                                                                                        LU [ScaLAPACK][GDXrsquo11][Trsquo99][SDrsquo11] [GDXrsquo11][Trsquo99][SDrsquo11] L nP12

                                                                                                        QR [ScaLAPACK][DGHLrsquo12] [Trsquo99] [DGHLrsquo12][Trsquo99] L nP12

                                                                                                        Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

                                                                                                        Sym Eig amp SVD [BDDrsquo11][BDKrsquo13][ScaLAPACK] [BDDrsquo11][BDKrsquo13] L nP12

                                                                                                        Non-Sym Eig [BDDrsquo11] [BDDrsquo11] BW P12 L n

                                                                                                        Attaining with extra memory 25D M=(cn2P)

                                                                                                        Attaining the Lower bounds Parallel 2DM=(n2P)(Ignoring poly-log(P) factors words = ( n2 P12) messages = (P12)

                                                                                                        Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                        ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                        ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                        bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                        bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                        Avoiding Communication in Iterative Linear Algebra

                                                                                                        bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

                                                                                                        bull Conjugate Gradients (CG) GMRES Lanczos Arnoldi hellip bull Goal minimize communication

                                                                                                        ndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

                                                                                                        bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

                                                                                                        ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

                                                                                                        bull Lots of speed up possible (modeled and measured)ndash Price some redundant computationndash Challenges Poor partitioning Preconditioning Num Stability

                                                                                                        75

                                                                                                        Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                        ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                        ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                        bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                        bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                        Example The Difficulty of Tuning SpMV

                                                                                                        bull n = 21200bull nnz = 15 M

                                                                                                        bull Source NASA structural analysis problem (raefsky)

                                                                                                        77

                                                                                                        Example The Difficulty of Tuning

                                                                                                        bull n = 21200bull nnz = 15 M

                                                                                                        bull Source NASA structural analysis problem (raefsky)

                                                                                                        bull 8x8 dense substructure exploit this to limit mem_refs

                                                                                                        78

                                                                                                        Speedups on Itanium 2 The Need for Search

                                                                                                        Reference

                                                                                                        Best 4x2

                                                                                                        Mflops

                                                                                                        Mflops

                                                                                                        79

                                                                                                        Register Profile Itanium 2

                                                                                                        190 Mflops

                                                                                                        1190 Mflops

                                                                                                        80

                                                                                                        Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16

                                                                                                        Itanium 2 - 33Itanium 1 - 8

                                                                                                        252 Mflops

                                                                                                        122 Mflops

                                                                                                        820 Mflops

                                                                                                        459 Mflops

                                                                                                        247 Mflops

                                                                                                        107 Mflops

                                                                                                        12 Gflops

                                                                                                        190 Mflops

                                                                                                        Another example of tuning challenges for SpMV

                                                                                                        bull Ex11 matrix (fluid flow)

                                                                                                        bull More complicated non-zero structure in general

                                                                                                        bull N = 16614bull NNZ = 11M

                                                                                                        82

                                                                                                        Zoom in to top corner

                                                                                                        bull More complicated non-zero structure in general

                                                                                                        bull N = 16614bull NNZ = 11M

                                                                                                        83

                                                                                                        3x3 blocks look natural buthellip

                                                                                                        bull Example 3x3 blockingndash Logical grid of 3x3 cells

                                                                                                        bull But would lead to lots of ldquofill-inrdquo

                                                                                                        84

                                                                                                        Extra Work Can Improve Efficiency

                                                                                                        bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

                                                                                                        bull On Pentium III 15x speedupndash Actual mflop rate 152 = 225 higher

                                                                                                        85

                                                                                                        Source Accelerator Cavity Design Problem (Ko via Husbands)

                                                                                                        86

                                                                                                        100x100 Submatrix Along Diagonal

                                                                                                        Summer School Lecture 787

                                                                                                        Post-RCM Reordering

                                                                                                        88

                                                                                                        Effect of Combined RCM+TSP Reordering

                                                                                                        Before Green + RedAfter Green + Blue

                                                                                                        Summer School Lecture 789

                                                                                                        2x speedups on Pentium 4 Power 4 hellip

                                                                                                        Summary of Other Performance Optimizations

                                                                                                        bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

                                                                                                        bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

                                                                                                        bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

                                                                                                        90

                                                                                                        Optimized Sparse Kernel Interface - OSKI

                                                                                                        bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

                                                                                                        bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

                                                                                                        bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

                                                                                                        software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

                                                                                                        91

                                                                                                        Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                        ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                        ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                        bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                        bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                        93

                                                                                                        Example Classical Conjugate Gradient (CG)

                                                                                                        SpMVs and dot products require communication in

                                                                                                        each iteration

                                                                                                        via CA Matrix Powers Kernel

                                                                                                        Global reduction to compute G

                                                                                                        94

                                                                                                        Example CA-Conjugate Gradient

                                                                                                        Local computations within inner loop require

                                                                                                        no communication

                                                                                                        Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                        ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                        ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                        bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                        bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                        96

                                                                                                        Slower convergence due

                                                                                                        to roundoff

                                                                                                        Loss of accuracy due to roundoff

                                                                                                        At s = 16 monomial basis is rank deficient Method breaks down

                                                                                                        Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

                                                                                                        CA-CG (monomial)CG

                                                                                                        machine precision

                                                                                                        97

                                                                                                        Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                        ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                        ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                        bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                        bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                        What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

                                                                                                        bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

                                                                                                        bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

                                                                                                        matrices

                                                                                                        Explicit (O(nnz)) Implicit (o(nnz))

                                                                                                        Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

                                                                                                        Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

                                                                                                        Indices

                                                                                                        Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                        ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                        ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                        bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                        bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                        101

                                                                                                        bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

                                                                                                        ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

                                                                                                        demanded by customers (construction engineers) otherwise they donrsquot believe results

                                                                                                        ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

                                                                                                        ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

                                                                                                        Reproducible Floating Point Computation

                                                                                                        Absolute Error for Random Vectors

                                                                                                        Same magnitude opposite signs

                                                                                                        Intel MKL non-reproducibility

                                                                                                        Relative Error for Orthogonal vectors

                                                                                                        Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

                                                                                                        Sign notreproducible

                                                                                                        103

                                                                                                        bull Consider summation or dot productbull Goals

                                                                                                        1 Same answer independent of layout processors order of summands

                                                                                                        2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

                                                                                                        bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

                                                                                                        GoalsApproaches for Reproducibility

                                                                                                        104

                                                                                                        Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

                                                                                                        Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

                                                                                                        Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

                                                                                                        bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

                                                                                                        Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

                                                                                                        bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

                                                                                                        Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

                                                                                                        Instruments NEC Nokia NVIDIA Samsung Oracle

                                                                                                        bull bebopcsberkeleyedu

                                                                                                        Summary

                                                                                                        Donrsquot Communichellip

                                                                                                        106

                                                                                                        Time to redesign all linear algebra n-body hellip algorithms and software

                                                                                                        (and compilers)

                                                                                                        • Implementing Communication-Avoiding Algorithms
                                                                                                        • Why avoid communication
                                                                                                        • Goals
                                                                                                        • Outline
                                                                                                        • Outline (2)
                                                                                                        • Lower bound for all ldquon3-likerdquo linear algebra
                                                                                                        • Lower bound for all ldquon3-likerdquo linear algebra (2)
                                                                                                        • Lower bound for all ldquon3-likerdquo linear algebra (3)
                                                                                                        • Limits to parallel scaling (12)
                                                                                                        • Limits to parallel scaling (22)
                                                                                                        • Can we attain these lower bounds
                                                                                                        • Outline (3)
                                                                                                        • 25D Matrix Multiplication
                                                                                                        • 25D Matrix Multiplication (2)
                                                                                                        • 25D Matmul on BGP 16K nodes 64K cores (2)
                                                                                                        • Perfect Strong Scaling ndash in Time and Energy (12)
                                                                                                        • Perfect Strong Scaling ndash in Time and Energy (22)
                                                                                                        • Handling Heterogeneity
                                                                                                        • Application to Tensor Contractions
                                                                                                        • C(ijk) = Σm A(ijm)B(mk)
                                                                                                        • Application to Tensor Contractions (2)
                                                                                                        • Communication Lower Bounds for Strassen-like matmul algorithms
                                                                                                        • vs
                                                                                                        • Slide 26
                                                                                                        • Strassen-like beyond matmul
                                                                                                        • Cache and Network Oblivious Algorithms
                                                                                                        • CARMA Performance Distributed Memory
                                                                                                        • CARMA Performance Distributed Memory (2)
                                                                                                        • CARMA Performance Shared Memory
                                                                                                        • CARMA Performance Shared Memory (2)
                                                                                                        • Why is CARMA Faster in Shared Memory
                                                                                                        • Outline (4)
                                                                                                        • One-sided Factorizations (LU QR) so far
                                                                                                        • TSQR An Architecture-Dependent Algorithm
                                                                                                        • Back to LU Using similar idea for TSLU as TSQR Use reduction
                                                                                                        • Minimizing Communication in TSLU
                                                                                                        • Making TSLU Numerically Stable
                                                                                                        • Stability of LU using TSLU CALU
                                                                                                        • Why is stability of TSLU just a ldquoThmrdquo
                                                                                                        • Fixing TSLU
                                                                                                        • 2D CALU with Tournament Pivoting
                                                                                                        • 25D CALU with Tournament Pivoting (c=4 copies)
                                                                                                        • Exascale Machine Parameters Source DOE Exascale Workshop
                                                                                                        • Exascale predicted speedups for Gaussian Elimination 2D CA
                                                                                                        • 25D vs 2D LU With and Without Pivoting
                                                                                                        • Other CA algorithms for Ax=b least squares(13)
                                                                                                        • Other CA algorithms for Ax=b least squares (23)
                                                                                                        • Other CA algorithms for Ax=b least squares (33)
                                                                                                        • Outline (5)
                                                                                                        • What about sparse matrices (13)
                                                                                                        • Performance of 25D APSP using Kleene
                                                                                                        • What about sparse matrices (23)
                                                                                                        • What about sparse matrices (33)
                                                                                                        • Outline (6)
                                                                                                        • Symmetric Eigenproblem and SVD
                                                                                                        • Slide 58
                                                                                                        • Slide 59
                                                                                                        • Slide 60
                                                                                                        • Slide 61
                                                                                                        • Slide 62
                                                                                                        • Slide 63
                                                                                                        • Slide 64
                                                                                                        • Slide 65
                                                                                                        • Slide 66
                                                                                                        • Slide 67
                                                                                                        • Slide 68
                                                                                                        • Conventional vs CA - SBR
                                                                                                        • Speedups of Sym Band Reduction vs DSBTRD
                                                                                                        • Nonsymmetric Eigenproblem
                                                                                                        • Attaining the Lower bounds Sequential
                                                                                                        • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
                                                                                                        • Outline (7)
                                                                                                        • Avoiding Communication in Iterative Linear Algebra
                                                                                                        • Outline (8)
                                                                                                        • Example The Difficulty of Tuning SpMV
                                                                                                        • Example The Difficulty of Tuning
                                                                                                        • Speedups on Itanium 2 The Need for Search
                                                                                                        • Register Profile Itanium 2
                                                                                                        • Register Profiles IBM and Intel IA-64
                                                                                                        • Another example of tuning challenges for SpMV
                                                                                                        • Zoom in to top corner
                                                                                                        • 3x3 blocks look natural buthellip
                                                                                                        • Extra Work Can Improve Efficiency
                                                                                                        • Slide 86
                                                                                                        • Slide 87
                                                                                                        • Slide 88
                                                                                                        • Slide 89
                                                                                                        • Summary of Other Performance Optimizations
                                                                                                        • Optimized Sparse Kernel Interface - OSKI
                                                                                                        • Outline (9)
                                                                                                        • Example Classical Conjugate Gradient (CG)
                                                                                                        • Example CA-Conjugate Gradient
                                                                                                        • Outline (10)
                                                                                                        • Slide 96
                                                                                                        • Slide 97
                                                                                                        • Outline (11)
                                                                                                        • What is a ldquosparse matrixrdquo
                                                                                                        • Outline (12)
                                                                                                        • Reproducible Floating Point Computation
                                                                                                        • Intel MKL non-reproducibility
                                                                                                        • GoalsApproaches for Reproducibility
                                                                                                        • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
                                                                                                        • Collaborators and Supporters
                                                                                                        • Summary

                                                                                                          What about sparse matrices (33)

                                                                                                          bull If matrix stays very sparse lower bound unattainable new one bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

                                                                                                          data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

                                                                                                          bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

                                                                                                          Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

                                                                                                          bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

                                                                                                          along dimensions most likely to minimize cost55

                                                                                                          Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                          ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                          ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                          bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                          bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                          Symmetric Eigenproblem and SVD

                                                                                                          bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd

                                                                                                          bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12

                                                                                                          ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea

                                                                                                          b+1

                                                                                                          b+1

                                                                                                          Successive Band Reduction (BischofLangSun)

                                                                                                          1

                                                                                                          b+1

                                                                                                          b+1

                                                                                                          d+1

                                                                                                          c

                                                                                                          Successive Band Reduction (BischofLangSun)

                                                                                                          b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                                          1Q1

                                                                                                          b+1

                                                                                                          b+1

                                                                                                          d+1

                                                                                                          c

                                                                                                          b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                                          Successive Band Reduction (BischofLangSun)

                                                                                                          12

                                                                                                          Q1

                                                                                                          b+1

                                                                                                          b+1

                                                                                                          d+1

                                                                                                          d+c

                                                                                                          d+c

                                                                                                          c

                                                                                                          b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                                          Successive Band Reduction (BischofLangSun)

                                                                                                          1

                                                                                                          12

                                                                                                          Q1

                                                                                                          Q1T

                                                                                                          b+1

                                                                                                          b+1

                                                                                                          d+1

                                                                                                          d+1

                                                                                                          cd+c

                                                                                                          d+c

                                                                                                          c

                                                                                                          b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                                          Successive Band Reduction (BischofLangSun)

                                                                                                          1

                                                                                                          1

                                                                                                          2

                                                                                                          2Q1

                                                                                                          Q1T

                                                                                                          b+1

                                                                                                          b+1

                                                                                                          d+1

                                                                                                          d+1

                                                                                                          cd+c

                                                                                                          d+c

                                                                                                          d+c

                                                                                                          d+c

                                                                                                          c

                                                                                                          b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                                          Successive Band Reduction (BischofLangSun)

                                                                                                          1

                                                                                                          1

                                                                                                          2

                                                                                                          2

                                                                                                          3

                                                                                                          3

                                                                                                          Q1

                                                                                                          Q1T

                                                                                                          Q2

                                                                                                          Q2T

                                                                                                          b+1

                                                                                                          b+1

                                                                                                          d+1

                                                                                                          d+1

                                                                                                          d+c

                                                                                                          d+c

                                                                                                          d+c

                                                                                                          d+c

                                                                                                          c

                                                                                                          c

                                                                                                          b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                                          Successive Band Reduction (BischofLangSun)

                                                                                                          1

                                                                                                          1

                                                                                                          2

                                                                                                          2

                                                                                                          3

                                                                                                          3

                                                                                                          4

                                                                                                          4

                                                                                                          Q1

                                                                                                          Q1T

                                                                                                          Q2

                                                                                                          Q2T

                                                                                                          Q3

                                                                                                          Q3T

                                                                                                          b+1

                                                                                                          b+1

                                                                                                          d+1

                                                                                                          d+1

                                                                                                          d+c

                                                                                                          d+c

                                                                                                          d+c

                                                                                                          d+c

                                                                                                          c

                                                                                                          c

                                                                                                          b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                                          Successive Band Reduction (BischofLangSun)

                                                                                                          1

                                                                                                          1

                                                                                                          2

                                                                                                          2

                                                                                                          3

                                                                                                          3

                                                                                                          4

                                                                                                          4

                                                                                                          5

                                                                                                          5

                                                                                                          Q1

                                                                                                          Q1T

                                                                                                          Q2

                                                                                                          Q2T

                                                                                                          Q3

                                                                                                          Q3T

                                                                                                          Q4

                                                                                                          Q4T

                                                                                                          b+1

                                                                                                          b+1

                                                                                                          d+1

                                                                                                          d+1

                                                                                                          c

                                                                                                          c

                                                                                                          d+c

                                                                                                          d+c

                                                                                                          d+c

                                                                                                          d+c

                                                                                                          b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                                          Successive Band Reduction (BischofLangSun)

                                                                                                          1

                                                                                                          1

                                                                                                          2

                                                                                                          2

                                                                                                          3

                                                                                                          3

                                                                                                          4

                                                                                                          4

                                                                                                          5

                                                                                                          5

                                                                                                          Q5T

                                                                                                          Q1

                                                                                                          Q1T

                                                                                                          Q2

                                                                                                          Q2T

                                                                                                          Q3

                                                                                                          Q3T

                                                                                                          Q5

                                                                                                          Q4

                                                                                                          Q4T

                                                                                                          b+1

                                                                                                          b+1

                                                                                                          d+1

                                                                                                          d+1

                                                                                                          c

                                                                                                          c

                                                                                                          d+c

                                                                                                          d+c

                                                                                                          d+c

                                                                                                          d+c

                                                                                                          b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                                          Successive Band Reduction (BischofLangSun)

                                                                                                          1

                                                                                                          1

                                                                                                          2

                                                                                                          2

                                                                                                          3

                                                                                                          3

                                                                                                          4

                                                                                                          4

                                                                                                          5

                                                                                                          5

                                                                                                          6

                                                                                                          6

                                                                                                          Q5T

                                                                                                          Q1

                                                                                                          Q1T

                                                                                                          Q2

                                                                                                          Q2T

                                                                                                          Q3

                                                                                                          Q3T

                                                                                                          Q5

                                                                                                          Q4

                                                                                                          Q4T

                                                                                                          b+1

                                                                                                          b+1

                                                                                                          d+1

                                                                                                          d+1

                                                                                                          c

                                                                                                          c

                                                                                                          d+c

                                                                                                          d+c

                                                                                                          d+c

                                                                                                          d+c

                                                                                                          b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                                          Successive Band Reduction (BischofLangSun)

                                                                                                          Conventional vs CA - SBR

                                                                                                          Conventional Communication-Avoiding

                                                                                                          Touch all data 4 times Touch all data once

                                                                                                          >
                                                                                                          >

                                                                                                          Speedups of Sym Band Reductionvs DSBTRD

                                                                                                          bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

                                                                                                          bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

                                                                                                          bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

                                                                                                          bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

                                                                                                          bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

                                                                                                          Nonsymmetric Eigenproblem

                                                                                                          bull No apparent way to modify standard algorithmbull Instead Spectral Divide-and-Conquer

                                                                                                          ndash Find orthogonal matrix Q whose leading columns span an invariant subspace of A

                                                                                                          ndash QTAQ will be block upper triangularndash Apply recursively to A11 A22

                                                                                                          ndash Depends on randomization1 Randomized Rank Revealing QR decomposition2 Randomized location to try splitting spectrum

                                                                                                          A11 A12

                                                                                                          ε A22

                                                                                                          Attaining the Lower bounds SequentialLegend[Existing][Ours][Math-Lib][Random]

                                                                                                          Two Levels Memory Hierarchy

                                                                                                          Words Messages Words Messages

                                                                                                          BLAS-3 [FLPRrsquo99][BDLSTrsquo13][MKL etc] [FLPRrsquo99][BDLSTrsquo13][MKL etc]

                                                                                                          Cholesky[Grsquo97][APrsquo00]

                                                                                                          [LAPACK][BDHSrsquo09]

                                                                                                          [Grsquo97][APrsquo00][BDHSrsquo09] [Grsquo97][APrsquo00][BDHSrsquo09]

                                                                                                          Sym Indefinite [BBDDDPSTYrsquo13] [BBDDDPSTYrsquo13]

                                                                                                          LU[Grsquo97][Trsquo97]

                                                                                                          [GDXrsquo11][BDLSTrsquo13]

                                                                                                          [GDXrsquo11][BDLSTrsquo13]

                                                                                                          [Grsquo97][Trsquo97] [BDLSTrsquo13] [BDLSTrsquo13]

                                                                                                          QR[EGrsquo98][FWrsquo03]

                                                                                                          [DGHLrsquo12][BDLSTrsquo13]

                                                                                                          [FWrsquo03][DGHLrsquo12][BDLSTrsquo13]

                                                                                                          [EGrsquo98][FWrsquo03][BDLSTrsquo13]

                                                                                                          [FWrsquo03][BDLSTrsquo13]

                                                                                                          Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

                                                                                                          Sym Eig amp SVD [BDDrsquo11][BDKrsquo13] [BDDrsquo11]

                                                                                                          Non Sym Eig [BDDrsquo11] [BDDrsquo11]

                                                                                                          Legend[Existing][Ours][Math-Lib][Random]

                                                                                                          Words (BW) Messages (L) Saving factor

                                                                                                          BLAS-3 [AGZrsquo94][MTrsquo99][ScaLAPACK][Crsquo69][vGWrsquo97][SDrsquo11] L nP12

                                                                                                          Cholesky [ScaLAPACK][Trsquo99][SDrsquo11] L nP12

                                                                                                          Sym Indefinite [BBDDDPSTYrsquo13][ScaLAPACK] [BBDDDPSTYrsquo13] L nP12

                                                                                                          LU [ScaLAPACK][GDXrsquo11][Trsquo99][SDrsquo11] [GDXrsquo11][Trsquo99][SDrsquo11] L nP12

                                                                                                          QR [ScaLAPACK][DGHLrsquo12] [Trsquo99] [DGHLrsquo12][Trsquo99] L nP12

                                                                                                          Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

                                                                                                          Sym Eig amp SVD [BDDrsquo11][BDKrsquo13][ScaLAPACK] [BDDrsquo11][BDKrsquo13] L nP12

                                                                                                          Non-Sym Eig [BDDrsquo11] [BDDrsquo11] BW P12 L n

                                                                                                          Attaining with extra memory 25D M=(cn2P)

                                                                                                          Attaining the Lower bounds Parallel 2DM=(n2P)(Ignoring poly-log(P) factors words = ( n2 P12) messages = (P12)

                                                                                                          Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                          ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                          ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                          bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                          bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                          Avoiding Communication in Iterative Linear Algebra

                                                                                                          bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

                                                                                                          bull Conjugate Gradients (CG) GMRES Lanczos Arnoldi hellip bull Goal minimize communication

                                                                                                          ndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

                                                                                                          bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

                                                                                                          ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

                                                                                                          bull Lots of speed up possible (modeled and measured)ndash Price some redundant computationndash Challenges Poor partitioning Preconditioning Num Stability

                                                                                                          75

                                                                                                          Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                          ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                          ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                          bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                          bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                          Example The Difficulty of Tuning SpMV

                                                                                                          bull n = 21200bull nnz = 15 M

                                                                                                          bull Source NASA structural analysis problem (raefsky)

                                                                                                          77

                                                                                                          Example The Difficulty of Tuning

                                                                                                          bull n = 21200bull nnz = 15 M

                                                                                                          bull Source NASA structural analysis problem (raefsky)

                                                                                                          bull 8x8 dense substructure exploit this to limit mem_refs

                                                                                                          78

                                                                                                          Speedups on Itanium 2 The Need for Search

                                                                                                          Reference

                                                                                                          Best 4x2

                                                                                                          Mflops

                                                                                                          Mflops

                                                                                                          79

                                                                                                          Register Profile Itanium 2

                                                                                                          190 Mflops

                                                                                                          1190 Mflops

                                                                                                          80

                                                                                                          Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16

                                                                                                          Itanium 2 - 33Itanium 1 - 8

                                                                                                          252 Mflops

                                                                                                          122 Mflops

                                                                                                          820 Mflops

                                                                                                          459 Mflops

                                                                                                          247 Mflops

                                                                                                          107 Mflops

                                                                                                          12 Gflops

                                                                                                          190 Mflops

                                                                                                          Another example of tuning challenges for SpMV

                                                                                                          bull Ex11 matrix (fluid flow)

                                                                                                          bull More complicated non-zero structure in general

                                                                                                          bull N = 16614bull NNZ = 11M

                                                                                                          82

                                                                                                          Zoom in to top corner

                                                                                                          bull More complicated non-zero structure in general

                                                                                                          bull N = 16614bull NNZ = 11M

                                                                                                          83

                                                                                                          3x3 blocks look natural buthellip

                                                                                                          bull Example 3x3 blockingndash Logical grid of 3x3 cells

                                                                                                          bull But would lead to lots of ldquofill-inrdquo

                                                                                                          84

                                                                                                          Extra Work Can Improve Efficiency

                                                                                                          bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

                                                                                                          bull On Pentium III 15x speedupndash Actual mflop rate 152 = 225 higher

                                                                                                          85

                                                                                                          Source Accelerator Cavity Design Problem (Ko via Husbands)

                                                                                                          86

                                                                                                          100x100 Submatrix Along Diagonal

                                                                                                          Summer School Lecture 787

                                                                                                          Post-RCM Reordering

                                                                                                          88

                                                                                                          Effect of Combined RCM+TSP Reordering

                                                                                                          Before Green + RedAfter Green + Blue

                                                                                                          Summer School Lecture 789

                                                                                                          2x speedups on Pentium 4 Power 4 hellip

                                                                                                          Summary of Other Performance Optimizations

                                                                                                          bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

                                                                                                          bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

                                                                                                          bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

                                                                                                          90

                                                                                                          Optimized Sparse Kernel Interface - OSKI

                                                                                                          bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

                                                                                                          bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

                                                                                                          bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

                                                                                                          software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

                                                                                                          91

                                                                                                          Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                          ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                          ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                          bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                          bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                          93

                                                                                                          Example Classical Conjugate Gradient (CG)

                                                                                                          SpMVs and dot products require communication in

                                                                                                          each iteration

                                                                                                          via CA Matrix Powers Kernel

                                                                                                          Global reduction to compute G

                                                                                                          94

                                                                                                          Example CA-Conjugate Gradient

                                                                                                          Local computations within inner loop require

                                                                                                          no communication

                                                                                                          Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                          ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                          ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                          bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                          bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                          96

                                                                                                          Slower convergence due

                                                                                                          to roundoff

                                                                                                          Loss of accuracy due to roundoff

                                                                                                          At s = 16 monomial basis is rank deficient Method breaks down

                                                                                                          Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

                                                                                                          CA-CG (monomial)CG

                                                                                                          machine precision

                                                                                                          97

                                                                                                          Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                          ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                          ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                          bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                          bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                          What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

                                                                                                          bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

                                                                                                          bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

                                                                                                          matrices

                                                                                                          Explicit (O(nnz)) Implicit (o(nnz))

                                                                                                          Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

                                                                                                          Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

                                                                                                          Indices

                                                                                                          Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                          ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                          ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                          bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                          bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                          101

                                                                                                          bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

                                                                                                          ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

                                                                                                          demanded by customers (construction engineers) otherwise they donrsquot believe results

                                                                                                          ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

                                                                                                          ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

                                                                                                          Reproducible Floating Point Computation

                                                                                                          Absolute Error for Random Vectors

                                                                                                          Same magnitude opposite signs

                                                                                                          Intel MKL non-reproducibility

                                                                                                          Relative Error for Orthogonal vectors

                                                                                                          Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

                                                                                                          Sign notreproducible

                                                                                                          103

                                                                                                          bull Consider summation or dot productbull Goals

                                                                                                          1 Same answer independent of layout processors order of summands

                                                                                                          2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

                                                                                                          bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

                                                                                                          GoalsApproaches for Reproducibility

                                                                                                          104

                                                                                                          Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

                                                                                                          Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

                                                                                                          Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

                                                                                                          bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

                                                                                                          Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

                                                                                                          bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

                                                                                                          Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

                                                                                                          Instruments NEC Nokia NVIDIA Samsung Oracle

                                                                                                          bull bebopcsberkeleyedu

                                                                                                          Summary

                                                                                                          Donrsquot Communichellip

                                                                                                          106

                                                                                                          Time to redesign all linear algebra n-body hellip algorithms and software

                                                                                                          (and compilers)

                                                                                                          • Implementing Communication-Avoiding Algorithms
                                                                                                          • Why avoid communication
                                                                                                          • Goals
                                                                                                          • Outline
                                                                                                          • Outline (2)
                                                                                                          • Lower bound for all ldquon3-likerdquo linear algebra
                                                                                                          • Lower bound for all ldquon3-likerdquo linear algebra (2)
                                                                                                          • Lower bound for all ldquon3-likerdquo linear algebra (3)
                                                                                                          • Limits to parallel scaling (12)
                                                                                                          • Limits to parallel scaling (22)
                                                                                                          • Can we attain these lower bounds
                                                                                                          • Outline (3)
                                                                                                          • 25D Matrix Multiplication
                                                                                                          • 25D Matrix Multiplication (2)
                                                                                                          • 25D Matmul on BGP 16K nodes 64K cores (2)
                                                                                                          • Perfect Strong Scaling ndash in Time and Energy (12)
                                                                                                          • Perfect Strong Scaling ndash in Time and Energy (22)
                                                                                                          • Handling Heterogeneity
                                                                                                          • Application to Tensor Contractions
                                                                                                          • C(ijk) = Σm A(ijm)B(mk)
                                                                                                          • Application to Tensor Contractions (2)
                                                                                                          • Communication Lower Bounds for Strassen-like matmul algorithms
                                                                                                          • vs
                                                                                                          • Slide 26
                                                                                                          • Strassen-like beyond matmul
                                                                                                          • Cache and Network Oblivious Algorithms
                                                                                                          • CARMA Performance Distributed Memory
                                                                                                          • CARMA Performance Distributed Memory (2)
                                                                                                          • CARMA Performance Shared Memory
                                                                                                          • CARMA Performance Shared Memory (2)
                                                                                                          • Why is CARMA Faster in Shared Memory
                                                                                                          • Outline (4)
                                                                                                          • One-sided Factorizations (LU QR) so far
                                                                                                          • TSQR An Architecture-Dependent Algorithm
                                                                                                          • Back to LU Using similar idea for TSLU as TSQR Use reduction
                                                                                                          • Minimizing Communication in TSLU
                                                                                                          • Making TSLU Numerically Stable
                                                                                                          • Stability of LU using TSLU CALU
                                                                                                          • Why is stability of TSLU just a ldquoThmrdquo
                                                                                                          • Fixing TSLU
                                                                                                          • 2D CALU with Tournament Pivoting
                                                                                                          • 25D CALU with Tournament Pivoting (c=4 copies)
                                                                                                          • Exascale Machine Parameters Source DOE Exascale Workshop
                                                                                                          • Exascale predicted speedups for Gaussian Elimination 2D CA
                                                                                                          • 25D vs 2D LU With and Without Pivoting
                                                                                                          • Other CA algorithms for Ax=b least squares(13)
                                                                                                          • Other CA algorithms for Ax=b least squares (23)
                                                                                                          • Other CA algorithms for Ax=b least squares (33)
                                                                                                          • Outline (5)
                                                                                                          • What about sparse matrices (13)
                                                                                                          • Performance of 25D APSP using Kleene
                                                                                                          • What about sparse matrices (23)
                                                                                                          • What about sparse matrices (33)
                                                                                                          • Outline (6)
                                                                                                          • Symmetric Eigenproblem and SVD
                                                                                                          • Slide 58
                                                                                                          • Slide 59
                                                                                                          • Slide 60
                                                                                                          • Slide 61
                                                                                                          • Slide 62
                                                                                                          • Slide 63
                                                                                                          • Slide 64
                                                                                                          • Slide 65
                                                                                                          • Slide 66
                                                                                                          • Slide 67
                                                                                                          • Slide 68
                                                                                                          • Conventional vs CA - SBR
                                                                                                          • Speedups of Sym Band Reduction vs DSBTRD
                                                                                                          • Nonsymmetric Eigenproblem
                                                                                                          • Attaining the Lower bounds Sequential
                                                                                                          • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
                                                                                                          • Outline (7)
                                                                                                          • Avoiding Communication in Iterative Linear Algebra
                                                                                                          • Outline (8)
                                                                                                          • Example The Difficulty of Tuning SpMV
                                                                                                          • Example The Difficulty of Tuning
                                                                                                          • Speedups on Itanium 2 The Need for Search
                                                                                                          • Register Profile Itanium 2
                                                                                                          • Register Profiles IBM and Intel IA-64
                                                                                                          • Another example of tuning challenges for SpMV
                                                                                                          • Zoom in to top corner
                                                                                                          • 3x3 blocks look natural buthellip
                                                                                                          • Extra Work Can Improve Efficiency
                                                                                                          • Slide 86
                                                                                                          • Slide 87
                                                                                                          • Slide 88
                                                                                                          • Slide 89
                                                                                                          • Summary of Other Performance Optimizations
                                                                                                          • Optimized Sparse Kernel Interface - OSKI
                                                                                                          • Outline (9)
                                                                                                          • Example Classical Conjugate Gradient (CG)
                                                                                                          • Example CA-Conjugate Gradient
                                                                                                          • Outline (10)
                                                                                                          • Slide 96
                                                                                                          • Slide 97
                                                                                                          • Outline (11)
                                                                                                          • What is a ldquosparse matrixrdquo
                                                                                                          • Outline (12)
                                                                                                          • Reproducible Floating Point Computation
                                                                                                          • Intel MKL non-reproducibility
                                                                                                          • GoalsApproaches for Reproducibility
                                                                                                          • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
                                                                                                          • Collaborators and Supporters
                                                                                                          • Summary

                                                                                                            Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                            ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                            ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                            bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                            bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                            Symmetric Eigenproblem and SVD

                                                                                                            bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd

                                                                                                            bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12

                                                                                                            ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea

                                                                                                            b+1

                                                                                                            b+1

                                                                                                            Successive Band Reduction (BischofLangSun)

                                                                                                            1

                                                                                                            b+1

                                                                                                            b+1

                                                                                                            d+1

                                                                                                            c

                                                                                                            Successive Band Reduction (BischofLangSun)

                                                                                                            b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                                            1Q1

                                                                                                            b+1

                                                                                                            b+1

                                                                                                            d+1

                                                                                                            c

                                                                                                            b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                                            Successive Band Reduction (BischofLangSun)

                                                                                                            12

                                                                                                            Q1

                                                                                                            b+1

                                                                                                            b+1

                                                                                                            d+1

                                                                                                            d+c

                                                                                                            d+c

                                                                                                            c

                                                                                                            b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                                            Successive Band Reduction (BischofLangSun)

                                                                                                            1

                                                                                                            12

                                                                                                            Q1

                                                                                                            Q1T

                                                                                                            b+1

                                                                                                            b+1

                                                                                                            d+1

                                                                                                            d+1

                                                                                                            cd+c

                                                                                                            d+c

                                                                                                            c

                                                                                                            b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                                            Successive Band Reduction (BischofLangSun)

                                                                                                            1

                                                                                                            1

                                                                                                            2

                                                                                                            2Q1

                                                                                                            Q1T

                                                                                                            b+1

                                                                                                            b+1

                                                                                                            d+1

                                                                                                            d+1

                                                                                                            cd+c

                                                                                                            d+c

                                                                                                            d+c

                                                                                                            d+c

                                                                                                            c

                                                                                                            b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                                            Successive Band Reduction (BischofLangSun)

                                                                                                            1

                                                                                                            1

                                                                                                            2

                                                                                                            2

                                                                                                            3

                                                                                                            3

                                                                                                            Q1

                                                                                                            Q1T

                                                                                                            Q2

                                                                                                            Q2T

                                                                                                            b+1

                                                                                                            b+1

                                                                                                            d+1

                                                                                                            d+1

                                                                                                            d+c

                                                                                                            d+c

                                                                                                            d+c

                                                                                                            d+c

                                                                                                            c

                                                                                                            c

                                                                                                            b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                                            Successive Band Reduction (BischofLangSun)

                                                                                                            1

                                                                                                            1

                                                                                                            2

                                                                                                            2

                                                                                                            3

                                                                                                            3

                                                                                                            4

                                                                                                            4

                                                                                                            Q1

                                                                                                            Q1T

                                                                                                            Q2

                                                                                                            Q2T

                                                                                                            Q3

                                                                                                            Q3T

                                                                                                            b+1

                                                                                                            b+1

                                                                                                            d+1

                                                                                                            d+1

                                                                                                            d+c

                                                                                                            d+c

                                                                                                            d+c

                                                                                                            d+c

                                                                                                            c

                                                                                                            c

                                                                                                            b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                                            Successive Band Reduction (BischofLangSun)

                                                                                                            1

                                                                                                            1

                                                                                                            2

                                                                                                            2

                                                                                                            3

                                                                                                            3

                                                                                                            4

                                                                                                            4

                                                                                                            5

                                                                                                            5

                                                                                                            Q1

                                                                                                            Q1T

                                                                                                            Q2

                                                                                                            Q2T

                                                                                                            Q3

                                                                                                            Q3T

                                                                                                            Q4

                                                                                                            Q4T

                                                                                                            b+1

                                                                                                            b+1

                                                                                                            d+1

                                                                                                            d+1

                                                                                                            c

                                                                                                            c

                                                                                                            d+c

                                                                                                            d+c

                                                                                                            d+c

                                                                                                            d+c

                                                                                                            b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                                            Successive Band Reduction (BischofLangSun)

                                                                                                            1

                                                                                                            1

                                                                                                            2

                                                                                                            2

                                                                                                            3

                                                                                                            3

                                                                                                            4

                                                                                                            4

                                                                                                            5

                                                                                                            5

                                                                                                            Q5T

                                                                                                            Q1

                                                                                                            Q1T

                                                                                                            Q2

                                                                                                            Q2T

                                                                                                            Q3

                                                                                                            Q3T

                                                                                                            Q5

                                                                                                            Q4

                                                                                                            Q4T

                                                                                                            b+1

                                                                                                            b+1

                                                                                                            d+1

                                                                                                            d+1

                                                                                                            c

                                                                                                            c

                                                                                                            d+c

                                                                                                            d+c

                                                                                                            d+c

                                                                                                            d+c

                                                                                                            b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                                            Successive Band Reduction (BischofLangSun)

                                                                                                            1

                                                                                                            1

                                                                                                            2

                                                                                                            2

                                                                                                            3

                                                                                                            3

                                                                                                            4

                                                                                                            4

                                                                                                            5

                                                                                                            5

                                                                                                            6

                                                                                                            6

                                                                                                            Q5T

                                                                                                            Q1

                                                                                                            Q1T

                                                                                                            Q2

                                                                                                            Q2T

                                                                                                            Q3

                                                                                                            Q3T

                                                                                                            Q5

                                                                                                            Q4

                                                                                                            Q4T

                                                                                                            b+1

                                                                                                            b+1

                                                                                                            d+1

                                                                                                            d+1

                                                                                                            c

                                                                                                            c

                                                                                                            d+c

                                                                                                            d+c

                                                                                                            d+c

                                                                                                            d+c

                                                                                                            b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                                            Successive Band Reduction (BischofLangSun)

                                                                                                            Conventional vs CA - SBR

                                                                                                            Conventional Communication-Avoiding

                                                                                                            Touch all data 4 times Touch all data once

                                                                                                            >
                                                                                                            >

                                                                                                            Speedups of Sym Band Reductionvs DSBTRD

                                                                                                            bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

                                                                                                            bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

                                                                                                            bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

                                                                                                            bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

                                                                                                            bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

                                                                                                            Nonsymmetric Eigenproblem

                                                                                                            bull No apparent way to modify standard algorithmbull Instead Spectral Divide-and-Conquer

                                                                                                            ndash Find orthogonal matrix Q whose leading columns span an invariant subspace of A

                                                                                                            ndash QTAQ will be block upper triangularndash Apply recursively to A11 A22

                                                                                                            ndash Depends on randomization1 Randomized Rank Revealing QR decomposition2 Randomized location to try splitting spectrum

                                                                                                            A11 A12

                                                                                                            ε A22

                                                                                                            Attaining the Lower bounds SequentialLegend[Existing][Ours][Math-Lib][Random]

                                                                                                            Two Levels Memory Hierarchy

                                                                                                            Words Messages Words Messages

                                                                                                            BLAS-3 [FLPRrsquo99][BDLSTrsquo13][MKL etc] [FLPRrsquo99][BDLSTrsquo13][MKL etc]

                                                                                                            Cholesky[Grsquo97][APrsquo00]

                                                                                                            [LAPACK][BDHSrsquo09]

                                                                                                            [Grsquo97][APrsquo00][BDHSrsquo09] [Grsquo97][APrsquo00][BDHSrsquo09]

                                                                                                            Sym Indefinite [BBDDDPSTYrsquo13] [BBDDDPSTYrsquo13]

                                                                                                            LU[Grsquo97][Trsquo97]

                                                                                                            [GDXrsquo11][BDLSTrsquo13]

                                                                                                            [GDXrsquo11][BDLSTrsquo13]

                                                                                                            [Grsquo97][Trsquo97] [BDLSTrsquo13] [BDLSTrsquo13]

                                                                                                            QR[EGrsquo98][FWrsquo03]

                                                                                                            [DGHLrsquo12][BDLSTrsquo13]

                                                                                                            [FWrsquo03][DGHLrsquo12][BDLSTrsquo13]

                                                                                                            [EGrsquo98][FWrsquo03][BDLSTrsquo13]

                                                                                                            [FWrsquo03][BDLSTrsquo13]

                                                                                                            Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

                                                                                                            Sym Eig amp SVD [BDDrsquo11][BDKrsquo13] [BDDrsquo11]

                                                                                                            Non Sym Eig [BDDrsquo11] [BDDrsquo11]

                                                                                                            Legend[Existing][Ours][Math-Lib][Random]

                                                                                                            Words (BW) Messages (L) Saving factor

                                                                                                            BLAS-3 [AGZrsquo94][MTrsquo99][ScaLAPACK][Crsquo69][vGWrsquo97][SDrsquo11] L nP12

                                                                                                            Cholesky [ScaLAPACK][Trsquo99][SDrsquo11] L nP12

                                                                                                            Sym Indefinite [BBDDDPSTYrsquo13][ScaLAPACK] [BBDDDPSTYrsquo13] L nP12

                                                                                                            LU [ScaLAPACK][GDXrsquo11][Trsquo99][SDrsquo11] [GDXrsquo11][Trsquo99][SDrsquo11] L nP12

                                                                                                            QR [ScaLAPACK][DGHLrsquo12] [Trsquo99] [DGHLrsquo12][Trsquo99] L nP12

                                                                                                            Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

                                                                                                            Sym Eig amp SVD [BDDrsquo11][BDKrsquo13][ScaLAPACK] [BDDrsquo11][BDKrsquo13] L nP12

                                                                                                            Non-Sym Eig [BDDrsquo11] [BDDrsquo11] BW P12 L n

                                                                                                            Attaining with extra memory 25D M=(cn2P)

                                                                                                            Attaining the Lower bounds Parallel 2DM=(n2P)(Ignoring poly-log(P) factors words = ( n2 P12) messages = (P12)

                                                                                                            Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                            ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                            ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                            bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                            bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                            Avoiding Communication in Iterative Linear Algebra

                                                                                                            bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

                                                                                                            bull Conjugate Gradients (CG) GMRES Lanczos Arnoldi hellip bull Goal minimize communication

                                                                                                            ndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

                                                                                                            bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

                                                                                                            ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

                                                                                                            bull Lots of speed up possible (modeled and measured)ndash Price some redundant computationndash Challenges Poor partitioning Preconditioning Num Stability

                                                                                                            75

                                                                                                            Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                            ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                            ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                            bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                            bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                            Example The Difficulty of Tuning SpMV

                                                                                                            bull n = 21200bull nnz = 15 M

                                                                                                            bull Source NASA structural analysis problem (raefsky)

                                                                                                            77

                                                                                                            Example The Difficulty of Tuning

                                                                                                            bull n = 21200bull nnz = 15 M

                                                                                                            bull Source NASA structural analysis problem (raefsky)

                                                                                                            bull 8x8 dense substructure exploit this to limit mem_refs

                                                                                                            78

                                                                                                            Speedups on Itanium 2 The Need for Search

                                                                                                            Reference

                                                                                                            Best 4x2

                                                                                                            Mflops

                                                                                                            Mflops

                                                                                                            79

                                                                                                            Register Profile Itanium 2

                                                                                                            190 Mflops

                                                                                                            1190 Mflops

                                                                                                            80

                                                                                                            Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16

                                                                                                            Itanium 2 - 33Itanium 1 - 8

                                                                                                            252 Mflops

                                                                                                            122 Mflops

                                                                                                            820 Mflops

                                                                                                            459 Mflops

                                                                                                            247 Mflops

                                                                                                            107 Mflops

                                                                                                            12 Gflops

                                                                                                            190 Mflops

                                                                                                            Another example of tuning challenges for SpMV

                                                                                                            bull Ex11 matrix (fluid flow)

                                                                                                            bull More complicated non-zero structure in general

                                                                                                            bull N = 16614bull NNZ = 11M

                                                                                                            82

                                                                                                            Zoom in to top corner

                                                                                                            bull More complicated non-zero structure in general

                                                                                                            bull N = 16614bull NNZ = 11M

                                                                                                            83

                                                                                                            3x3 blocks look natural buthellip

                                                                                                            bull Example 3x3 blockingndash Logical grid of 3x3 cells

                                                                                                            bull But would lead to lots of ldquofill-inrdquo

                                                                                                            84

                                                                                                            Extra Work Can Improve Efficiency

                                                                                                            bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

                                                                                                            bull On Pentium III 15x speedupndash Actual mflop rate 152 = 225 higher

                                                                                                            85

                                                                                                            Source Accelerator Cavity Design Problem (Ko via Husbands)

                                                                                                            86

                                                                                                            100x100 Submatrix Along Diagonal

                                                                                                            Summer School Lecture 787

                                                                                                            Post-RCM Reordering

                                                                                                            88

                                                                                                            Effect of Combined RCM+TSP Reordering

                                                                                                            Before Green + RedAfter Green + Blue

                                                                                                            Summer School Lecture 789

                                                                                                            2x speedups on Pentium 4 Power 4 hellip

                                                                                                            Summary of Other Performance Optimizations

                                                                                                            bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

                                                                                                            bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

                                                                                                            bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

                                                                                                            90

                                                                                                            Optimized Sparse Kernel Interface - OSKI

                                                                                                            bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

                                                                                                            bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

                                                                                                            bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

                                                                                                            software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

                                                                                                            91

                                                                                                            Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                            ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                            ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                            bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                            bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                            93

                                                                                                            Example Classical Conjugate Gradient (CG)

                                                                                                            SpMVs and dot products require communication in

                                                                                                            each iteration

                                                                                                            via CA Matrix Powers Kernel

                                                                                                            Global reduction to compute G

                                                                                                            94

                                                                                                            Example CA-Conjugate Gradient

                                                                                                            Local computations within inner loop require

                                                                                                            no communication

                                                                                                            Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                            ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                            ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                            bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                            bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                            96

                                                                                                            Slower convergence due

                                                                                                            to roundoff

                                                                                                            Loss of accuracy due to roundoff

                                                                                                            At s = 16 monomial basis is rank deficient Method breaks down

                                                                                                            Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

                                                                                                            CA-CG (monomial)CG

                                                                                                            machine precision

                                                                                                            97

                                                                                                            Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                            ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                            ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                            bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                            bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                            What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

                                                                                                            bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

                                                                                                            bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

                                                                                                            matrices

                                                                                                            Explicit (O(nnz)) Implicit (o(nnz))

                                                                                                            Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

                                                                                                            Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

                                                                                                            Indices

                                                                                                            Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                            ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                            ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                            bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                            bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                            101

                                                                                                            bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

                                                                                                            ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

                                                                                                            demanded by customers (construction engineers) otherwise they donrsquot believe results

                                                                                                            ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

                                                                                                            ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

                                                                                                            Reproducible Floating Point Computation

                                                                                                            Absolute Error for Random Vectors

                                                                                                            Same magnitude opposite signs

                                                                                                            Intel MKL non-reproducibility

                                                                                                            Relative Error for Orthogonal vectors

                                                                                                            Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

                                                                                                            Sign notreproducible

                                                                                                            103

                                                                                                            bull Consider summation or dot productbull Goals

                                                                                                            1 Same answer independent of layout processors order of summands

                                                                                                            2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

                                                                                                            bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

                                                                                                            GoalsApproaches for Reproducibility

                                                                                                            104

                                                                                                            Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

                                                                                                            Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

                                                                                                            Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

                                                                                                            bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

                                                                                                            Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

                                                                                                            bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

                                                                                                            Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

                                                                                                            Instruments NEC Nokia NVIDIA Samsung Oracle

                                                                                                            bull bebopcsberkeleyedu

                                                                                                            Summary

                                                                                                            Donrsquot Communichellip

                                                                                                            106

                                                                                                            Time to redesign all linear algebra n-body hellip algorithms and software

                                                                                                            (and compilers)

                                                                                                            • Implementing Communication-Avoiding Algorithms
                                                                                                            • Why avoid communication
                                                                                                            • Goals
                                                                                                            • Outline
                                                                                                            • Outline (2)
                                                                                                            • Lower bound for all ldquon3-likerdquo linear algebra
                                                                                                            • Lower bound for all ldquon3-likerdquo linear algebra (2)
                                                                                                            • Lower bound for all ldquon3-likerdquo linear algebra (3)
                                                                                                            • Limits to parallel scaling (12)
                                                                                                            • Limits to parallel scaling (22)
                                                                                                            • Can we attain these lower bounds
                                                                                                            • Outline (3)
                                                                                                            • 25D Matrix Multiplication
                                                                                                            • 25D Matrix Multiplication (2)
                                                                                                            • 25D Matmul on BGP 16K nodes 64K cores (2)
                                                                                                            • Perfect Strong Scaling ndash in Time and Energy (12)
                                                                                                            • Perfect Strong Scaling ndash in Time and Energy (22)
                                                                                                            • Handling Heterogeneity
                                                                                                            • Application to Tensor Contractions
                                                                                                            • C(ijk) = Σm A(ijm)B(mk)
                                                                                                            • Application to Tensor Contractions (2)
                                                                                                            • Communication Lower Bounds for Strassen-like matmul algorithms
                                                                                                            • vs
                                                                                                            • Slide 26
                                                                                                            • Strassen-like beyond matmul
                                                                                                            • Cache and Network Oblivious Algorithms
                                                                                                            • CARMA Performance Distributed Memory
                                                                                                            • CARMA Performance Distributed Memory (2)
                                                                                                            • CARMA Performance Shared Memory
                                                                                                            • CARMA Performance Shared Memory (2)
                                                                                                            • Why is CARMA Faster in Shared Memory
                                                                                                            • Outline (4)
                                                                                                            • One-sided Factorizations (LU QR) so far
                                                                                                            • TSQR An Architecture-Dependent Algorithm
                                                                                                            • Back to LU Using similar idea for TSLU as TSQR Use reduction
                                                                                                            • Minimizing Communication in TSLU
                                                                                                            • Making TSLU Numerically Stable
                                                                                                            • Stability of LU using TSLU CALU
                                                                                                            • Why is stability of TSLU just a ldquoThmrdquo
                                                                                                            • Fixing TSLU
                                                                                                            • 2D CALU with Tournament Pivoting
                                                                                                            • 25D CALU with Tournament Pivoting (c=4 copies)
                                                                                                            • Exascale Machine Parameters Source DOE Exascale Workshop
                                                                                                            • Exascale predicted speedups for Gaussian Elimination 2D CA
                                                                                                            • 25D vs 2D LU With and Without Pivoting
                                                                                                            • Other CA algorithms for Ax=b least squares(13)
                                                                                                            • Other CA algorithms for Ax=b least squares (23)
                                                                                                            • Other CA algorithms for Ax=b least squares (33)
                                                                                                            • Outline (5)
                                                                                                            • What about sparse matrices (13)
                                                                                                            • Performance of 25D APSP using Kleene
                                                                                                            • What about sparse matrices (23)
                                                                                                            • What about sparse matrices (33)
                                                                                                            • Outline (6)
                                                                                                            • Symmetric Eigenproblem and SVD
                                                                                                            • Slide 58
                                                                                                            • Slide 59
                                                                                                            • Slide 60
                                                                                                            • Slide 61
                                                                                                            • Slide 62
                                                                                                            • Slide 63
                                                                                                            • Slide 64
                                                                                                            • Slide 65
                                                                                                            • Slide 66
                                                                                                            • Slide 67
                                                                                                            • Slide 68
                                                                                                            • Conventional vs CA - SBR
                                                                                                            • Speedups of Sym Band Reduction vs DSBTRD
                                                                                                            • Nonsymmetric Eigenproblem
                                                                                                            • Attaining the Lower bounds Sequential
                                                                                                            • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
                                                                                                            • Outline (7)
                                                                                                            • Avoiding Communication in Iterative Linear Algebra
                                                                                                            • Outline (8)
                                                                                                            • Example The Difficulty of Tuning SpMV
                                                                                                            • Example The Difficulty of Tuning
                                                                                                            • Speedups on Itanium 2 The Need for Search
                                                                                                            • Register Profile Itanium 2
                                                                                                            • Register Profiles IBM and Intel IA-64
                                                                                                            • Another example of tuning challenges for SpMV
                                                                                                            • Zoom in to top corner
                                                                                                            • 3x3 blocks look natural buthellip
                                                                                                            • Extra Work Can Improve Efficiency
                                                                                                            • Slide 86
                                                                                                            • Slide 87
                                                                                                            • Slide 88
                                                                                                            • Slide 89
                                                                                                            • Summary of Other Performance Optimizations
                                                                                                            • Optimized Sparse Kernel Interface - OSKI
                                                                                                            • Outline (9)
                                                                                                            • Example Classical Conjugate Gradient (CG)
                                                                                                            • Example CA-Conjugate Gradient
                                                                                                            • Outline (10)
                                                                                                            • Slide 96
                                                                                                            • Slide 97
                                                                                                            • Outline (11)
                                                                                                            • What is a ldquosparse matrixrdquo
                                                                                                            • Outline (12)
                                                                                                            • Reproducible Floating Point Computation
                                                                                                            • Intel MKL non-reproducibility
                                                                                                            • GoalsApproaches for Reproducibility
                                                                                                            • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
                                                                                                            • Collaborators and Supporters
                                                                                                            • Summary

                                                                                                              Symmetric Eigenproblem and SVD

                                                                                                              bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd

                                                                                                              bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12

                                                                                                              ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea

                                                                                                              b+1

                                                                                                              b+1

                                                                                                              Successive Band Reduction (BischofLangSun)

                                                                                                              1

                                                                                                              b+1

                                                                                                              b+1

                                                                                                              d+1

                                                                                                              c

                                                                                                              Successive Band Reduction (BischofLangSun)

                                                                                                              b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                                              1Q1

                                                                                                              b+1

                                                                                                              b+1

                                                                                                              d+1

                                                                                                              c

                                                                                                              b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                                              Successive Band Reduction (BischofLangSun)

                                                                                                              12

                                                                                                              Q1

                                                                                                              b+1

                                                                                                              b+1

                                                                                                              d+1

                                                                                                              d+c

                                                                                                              d+c

                                                                                                              c

                                                                                                              b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                                              Successive Band Reduction (BischofLangSun)

                                                                                                              1

                                                                                                              12

                                                                                                              Q1

                                                                                                              Q1T

                                                                                                              b+1

                                                                                                              b+1

                                                                                                              d+1

                                                                                                              d+1

                                                                                                              cd+c

                                                                                                              d+c

                                                                                                              c

                                                                                                              b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                                              Successive Band Reduction (BischofLangSun)

                                                                                                              1

                                                                                                              1

                                                                                                              2

                                                                                                              2Q1

                                                                                                              Q1T

                                                                                                              b+1

                                                                                                              b+1

                                                                                                              d+1

                                                                                                              d+1

                                                                                                              cd+c

                                                                                                              d+c

                                                                                                              d+c

                                                                                                              d+c

                                                                                                              c

                                                                                                              b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                                              Successive Band Reduction (BischofLangSun)

                                                                                                              1

                                                                                                              1

                                                                                                              2

                                                                                                              2

                                                                                                              3

                                                                                                              3

                                                                                                              Q1

                                                                                                              Q1T

                                                                                                              Q2

                                                                                                              Q2T

                                                                                                              b+1

                                                                                                              b+1

                                                                                                              d+1

                                                                                                              d+1

                                                                                                              d+c

                                                                                                              d+c

                                                                                                              d+c

                                                                                                              d+c

                                                                                                              c

                                                                                                              c

                                                                                                              b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                                              Successive Band Reduction (BischofLangSun)

                                                                                                              1

                                                                                                              1

                                                                                                              2

                                                                                                              2

                                                                                                              3

                                                                                                              3

                                                                                                              4

                                                                                                              4

                                                                                                              Q1

                                                                                                              Q1T

                                                                                                              Q2

                                                                                                              Q2T

                                                                                                              Q3

                                                                                                              Q3T

                                                                                                              b+1

                                                                                                              b+1

                                                                                                              d+1

                                                                                                              d+1

                                                                                                              d+c

                                                                                                              d+c

                                                                                                              d+c

                                                                                                              d+c

                                                                                                              c

                                                                                                              c

                                                                                                              b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                                              Successive Band Reduction (BischofLangSun)

                                                                                                              1

                                                                                                              1

                                                                                                              2

                                                                                                              2

                                                                                                              3

                                                                                                              3

                                                                                                              4

                                                                                                              4

                                                                                                              5

                                                                                                              5

                                                                                                              Q1

                                                                                                              Q1T

                                                                                                              Q2

                                                                                                              Q2T

                                                                                                              Q3

                                                                                                              Q3T

                                                                                                              Q4

                                                                                                              Q4T

                                                                                                              b+1

                                                                                                              b+1

                                                                                                              d+1

                                                                                                              d+1

                                                                                                              c

                                                                                                              c

                                                                                                              d+c

                                                                                                              d+c

                                                                                                              d+c

                                                                                                              d+c

                                                                                                              b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                                              Successive Band Reduction (BischofLangSun)

                                                                                                              1

                                                                                                              1

                                                                                                              2

                                                                                                              2

                                                                                                              3

                                                                                                              3

                                                                                                              4

                                                                                                              4

                                                                                                              5

                                                                                                              5

                                                                                                              Q5T

                                                                                                              Q1

                                                                                                              Q1T

                                                                                                              Q2

                                                                                                              Q2T

                                                                                                              Q3

                                                                                                              Q3T

                                                                                                              Q5

                                                                                                              Q4

                                                                                                              Q4T

                                                                                                              b+1

                                                                                                              b+1

                                                                                                              d+1

                                                                                                              d+1

                                                                                                              c

                                                                                                              c

                                                                                                              d+c

                                                                                                              d+c

                                                                                                              d+c

                                                                                                              d+c

                                                                                                              b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                                              Successive Band Reduction (BischofLangSun)

                                                                                                              1

                                                                                                              1

                                                                                                              2

                                                                                                              2

                                                                                                              3

                                                                                                              3

                                                                                                              4

                                                                                                              4

                                                                                                              5

                                                                                                              5

                                                                                                              6

                                                                                                              6

                                                                                                              Q5T

                                                                                                              Q1

                                                                                                              Q1T

                                                                                                              Q2

                                                                                                              Q2T

                                                                                                              Q3

                                                                                                              Q3T

                                                                                                              Q5

                                                                                                              Q4

                                                                                                              Q4T

                                                                                                              b+1

                                                                                                              b+1

                                                                                                              d+1

                                                                                                              d+1

                                                                                                              c

                                                                                                              c

                                                                                                              d+c

                                                                                                              d+c

                                                                                                              d+c

                                                                                                              d+c

                                                                                                              b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                                              Successive Band Reduction (BischofLangSun)

                                                                                                              Conventional vs CA - SBR

                                                                                                              Conventional Communication-Avoiding

                                                                                                              Touch all data 4 times Touch all data once

                                                                                                              >
                                                                                                              >

                                                                                                              Speedups of Sym Band Reductionvs DSBTRD

                                                                                                              bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

                                                                                                              bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

                                                                                                              bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

                                                                                                              bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

                                                                                                              bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

                                                                                                              Nonsymmetric Eigenproblem

                                                                                                              bull No apparent way to modify standard algorithmbull Instead Spectral Divide-and-Conquer

                                                                                                              ndash Find orthogonal matrix Q whose leading columns span an invariant subspace of A

                                                                                                              ndash QTAQ will be block upper triangularndash Apply recursively to A11 A22

                                                                                                              ndash Depends on randomization1 Randomized Rank Revealing QR decomposition2 Randomized location to try splitting spectrum

                                                                                                              A11 A12

                                                                                                              ε A22

                                                                                                              Attaining the Lower bounds SequentialLegend[Existing][Ours][Math-Lib][Random]

                                                                                                              Two Levels Memory Hierarchy

                                                                                                              Words Messages Words Messages

                                                                                                              BLAS-3 [FLPRrsquo99][BDLSTrsquo13][MKL etc] [FLPRrsquo99][BDLSTrsquo13][MKL etc]

                                                                                                              Cholesky[Grsquo97][APrsquo00]

                                                                                                              [LAPACK][BDHSrsquo09]

                                                                                                              [Grsquo97][APrsquo00][BDHSrsquo09] [Grsquo97][APrsquo00][BDHSrsquo09]

                                                                                                              Sym Indefinite [BBDDDPSTYrsquo13] [BBDDDPSTYrsquo13]

                                                                                                              LU[Grsquo97][Trsquo97]

                                                                                                              [GDXrsquo11][BDLSTrsquo13]

                                                                                                              [GDXrsquo11][BDLSTrsquo13]

                                                                                                              [Grsquo97][Trsquo97] [BDLSTrsquo13] [BDLSTrsquo13]

                                                                                                              QR[EGrsquo98][FWrsquo03]

                                                                                                              [DGHLrsquo12][BDLSTrsquo13]

                                                                                                              [FWrsquo03][DGHLrsquo12][BDLSTrsquo13]

                                                                                                              [EGrsquo98][FWrsquo03][BDLSTrsquo13]

                                                                                                              [FWrsquo03][BDLSTrsquo13]

                                                                                                              Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

                                                                                                              Sym Eig amp SVD [BDDrsquo11][BDKrsquo13] [BDDrsquo11]

                                                                                                              Non Sym Eig [BDDrsquo11] [BDDrsquo11]

                                                                                                              Legend[Existing][Ours][Math-Lib][Random]

                                                                                                              Words (BW) Messages (L) Saving factor

                                                                                                              BLAS-3 [AGZrsquo94][MTrsquo99][ScaLAPACK][Crsquo69][vGWrsquo97][SDrsquo11] L nP12

                                                                                                              Cholesky [ScaLAPACK][Trsquo99][SDrsquo11] L nP12

                                                                                                              Sym Indefinite [BBDDDPSTYrsquo13][ScaLAPACK] [BBDDDPSTYrsquo13] L nP12

                                                                                                              LU [ScaLAPACK][GDXrsquo11][Trsquo99][SDrsquo11] [GDXrsquo11][Trsquo99][SDrsquo11] L nP12

                                                                                                              QR [ScaLAPACK][DGHLrsquo12] [Trsquo99] [DGHLrsquo12][Trsquo99] L nP12

                                                                                                              Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

                                                                                                              Sym Eig amp SVD [BDDrsquo11][BDKrsquo13][ScaLAPACK] [BDDrsquo11][BDKrsquo13] L nP12

                                                                                                              Non-Sym Eig [BDDrsquo11] [BDDrsquo11] BW P12 L n

                                                                                                              Attaining with extra memory 25D M=(cn2P)

                                                                                                              Attaining the Lower bounds Parallel 2DM=(n2P)(Ignoring poly-log(P) factors words = ( n2 P12) messages = (P12)

                                                                                                              Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                              ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                              ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                              bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                              bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                              Avoiding Communication in Iterative Linear Algebra

                                                                                                              bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

                                                                                                              bull Conjugate Gradients (CG) GMRES Lanczos Arnoldi hellip bull Goal minimize communication

                                                                                                              ndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

                                                                                                              bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

                                                                                                              ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

                                                                                                              bull Lots of speed up possible (modeled and measured)ndash Price some redundant computationndash Challenges Poor partitioning Preconditioning Num Stability

                                                                                                              75

                                                                                                              Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                              ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                              ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                              bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                              bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                              Example The Difficulty of Tuning SpMV

                                                                                                              bull n = 21200bull nnz = 15 M

                                                                                                              bull Source NASA structural analysis problem (raefsky)

                                                                                                              77

                                                                                                              Example The Difficulty of Tuning

                                                                                                              bull n = 21200bull nnz = 15 M

                                                                                                              bull Source NASA structural analysis problem (raefsky)

                                                                                                              bull 8x8 dense substructure exploit this to limit mem_refs

                                                                                                              78

                                                                                                              Speedups on Itanium 2 The Need for Search

                                                                                                              Reference

                                                                                                              Best 4x2

                                                                                                              Mflops

                                                                                                              Mflops

                                                                                                              79

                                                                                                              Register Profile Itanium 2

                                                                                                              190 Mflops

                                                                                                              1190 Mflops

                                                                                                              80

                                                                                                              Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16

                                                                                                              Itanium 2 - 33Itanium 1 - 8

                                                                                                              252 Mflops

                                                                                                              122 Mflops

                                                                                                              820 Mflops

                                                                                                              459 Mflops

                                                                                                              247 Mflops

                                                                                                              107 Mflops

                                                                                                              12 Gflops

                                                                                                              190 Mflops

                                                                                                              Another example of tuning challenges for SpMV

                                                                                                              bull Ex11 matrix (fluid flow)

                                                                                                              bull More complicated non-zero structure in general

                                                                                                              bull N = 16614bull NNZ = 11M

                                                                                                              82

                                                                                                              Zoom in to top corner

                                                                                                              bull More complicated non-zero structure in general

                                                                                                              bull N = 16614bull NNZ = 11M

                                                                                                              83

                                                                                                              3x3 blocks look natural buthellip

                                                                                                              bull Example 3x3 blockingndash Logical grid of 3x3 cells

                                                                                                              bull But would lead to lots of ldquofill-inrdquo

                                                                                                              84

                                                                                                              Extra Work Can Improve Efficiency

                                                                                                              bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

                                                                                                              bull On Pentium III 15x speedupndash Actual mflop rate 152 = 225 higher

                                                                                                              85

                                                                                                              Source Accelerator Cavity Design Problem (Ko via Husbands)

                                                                                                              86

                                                                                                              100x100 Submatrix Along Diagonal

                                                                                                              Summer School Lecture 787

                                                                                                              Post-RCM Reordering

                                                                                                              88

                                                                                                              Effect of Combined RCM+TSP Reordering

                                                                                                              Before Green + RedAfter Green + Blue

                                                                                                              Summer School Lecture 789

                                                                                                              2x speedups on Pentium 4 Power 4 hellip

                                                                                                              Summary of Other Performance Optimizations

                                                                                                              bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

                                                                                                              bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

                                                                                                              bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

                                                                                                              90

                                                                                                              Optimized Sparse Kernel Interface - OSKI

                                                                                                              bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

                                                                                                              bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

                                                                                                              bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

                                                                                                              software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

                                                                                                              91

                                                                                                              Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                              ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                              ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                              bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                              bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                              93

                                                                                                              Example Classical Conjugate Gradient (CG)

                                                                                                              SpMVs and dot products require communication in

                                                                                                              each iteration

                                                                                                              via CA Matrix Powers Kernel

                                                                                                              Global reduction to compute G

                                                                                                              94

                                                                                                              Example CA-Conjugate Gradient

                                                                                                              Local computations within inner loop require

                                                                                                              no communication

                                                                                                              Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                              ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                              ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                              bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                              bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                              96

                                                                                                              Slower convergence due

                                                                                                              to roundoff

                                                                                                              Loss of accuracy due to roundoff

                                                                                                              At s = 16 monomial basis is rank deficient Method breaks down

                                                                                                              Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

                                                                                                              CA-CG (monomial)CG

                                                                                                              machine precision

                                                                                                              97

                                                                                                              Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                              ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                              ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                              bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                              bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                              What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

                                                                                                              bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

                                                                                                              bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

                                                                                                              matrices

                                                                                                              Explicit (O(nnz)) Implicit (o(nnz))

                                                                                                              Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

                                                                                                              Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

                                                                                                              Indices

                                                                                                              Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                              ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                              ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                              bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                              bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                              101

                                                                                                              bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

                                                                                                              ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

                                                                                                              demanded by customers (construction engineers) otherwise they donrsquot believe results

                                                                                                              ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

                                                                                                              ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

                                                                                                              Reproducible Floating Point Computation

                                                                                                              Absolute Error for Random Vectors

                                                                                                              Same magnitude opposite signs

                                                                                                              Intel MKL non-reproducibility

                                                                                                              Relative Error for Orthogonal vectors

                                                                                                              Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

                                                                                                              Sign notreproducible

                                                                                                              103

                                                                                                              bull Consider summation or dot productbull Goals

                                                                                                              1 Same answer independent of layout processors order of summands

                                                                                                              2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

                                                                                                              bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

                                                                                                              GoalsApproaches for Reproducibility

                                                                                                              104

                                                                                                              Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

                                                                                                              Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

                                                                                                              Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

                                                                                                              bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

                                                                                                              Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

                                                                                                              bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

                                                                                                              Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

                                                                                                              Instruments NEC Nokia NVIDIA Samsung Oracle

                                                                                                              bull bebopcsberkeleyedu

                                                                                                              Summary

                                                                                                              Donrsquot Communichellip

                                                                                                              106

                                                                                                              Time to redesign all linear algebra n-body hellip algorithms and software

                                                                                                              (and compilers)

                                                                                                              • Implementing Communication-Avoiding Algorithms
                                                                                                              • Why avoid communication
                                                                                                              • Goals
                                                                                                              • Outline
                                                                                                              • Outline (2)
                                                                                                              • Lower bound for all ldquon3-likerdquo linear algebra
                                                                                                              • Lower bound for all ldquon3-likerdquo linear algebra (2)
                                                                                                              • Lower bound for all ldquon3-likerdquo linear algebra (3)
                                                                                                              • Limits to parallel scaling (12)
                                                                                                              • Limits to parallel scaling (22)
                                                                                                              • Can we attain these lower bounds
                                                                                                              • Outline (3)
                                                                                                              • 25D Matrix Multiplication
                                                                                                              • 25D Matrix Multiplication (2)
                                                                                                              • 25D Matmul on BGP 16K nodes 64K cores (2)
                                                                                                              • Perfect Strong Scaling ndash in Time and Energy (12)
                                                                                                              • Perfect Strong Scaling ndash in Time and Energy (22)
                                                                                                              • Handling Heterogeneity
                                                                                                              • Application to Tensor Contractions
                                                                                                              • C(ijk) = Σm A(ijm)B(mk)
                                                                                                              • Application to Tensor Contractions (2)
                                                                                                              • Communication Lower Bounds for Strassen-like matmul algorithms
                                                                                                              • vs
                                                                                                              • Slide 26
                                                                                                              • Strassen-like beyond matmul
                                                                                                              • Cache and Network Oblivious Algorithms
                                                                                                              • CARMA Performance Distributed Memory
                                                                                                              • CARMA Performance Distributed Memory (2)
                                                                                                              • CARMA Performance Shared Memory
                                                                                                              • CARMA Performance Shared Memory (2)
                                                                                                              • Why is CARMA Faster in Shared Memory
                                                                                                              • Outline (4)
                                                                                                              • One-sided Factorizations (LU QR) so far
                                                                                                              • TSQR An Architecture-Dependent Algorithm
                                                                                                              • Back to LU Using similar idea for TSLU as TSQR Use reduction
                                                                                                              • Minimizing Communication in TSLU
                                                                                                              • Making TSLU Numerically Stable
                                                                                                              • Stability of LU using TSLU CALU
                                                                                                              • Why is stability of TSLU just a ldquoThmrdquo
                                                                                                              • Fixing TSLU
                                                                                                              • 2D CALU with Tournament Pivoting
                                                                                                              • 25D CALU with Tournament Pivoting (c=4 copies)
                                                                                                              • Exascale Machine Parameters Source DOE Exascale Workshop
                                                                                                              • Exascale predicted speedups for Gaussian Elimination 2D CA
                                                                                                              • 25D vs 2D LU With and Without Pivoting
                                                                                                              • Other CA algorithms for Ax=b least squares(13)
                                                                                                              • Other CA algorithms for Ax=b least squares (23)
                                                                                                              • Other CA algorithms for Ax=b least squares (33)
                                                                                                              • Outline (5)
                                                                                                              • What about sparse matrices (13)
                                                                                                              • Performance of 25D APSP using Kleene
                                                                                                              • What about sparse matrices (23)
                                                                                                              • What about sparse matrices (33)
                                                                                                              • Outline (6)
                                                                                                              • Symmetric Eigenproblem and SVD
                                                                                                              • Slide 58
                                                                                                              • Slide 59
                                                                                                              • Slide 60
                                                                                                              • Slide 61
                                                                                                              • Slide 62
                                                                                                              • Slide 63
                                                                                                              • Slide 64
                                                                                                              • Slide 65
                                                                                                              • Slide 66
                                                                                                              • Slide 67
                                                                                                              • Slide 68
                                                                                                              • Conventional vs CA - SBR
                                                                                                              • Speedups of Sym Band Reduction vs DSBTRD
                                                                                                              • Nonsymmetric Eigenproblem
                                                                                                              • Attaining the Lower bounds Sequential
                                                                                                              • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
                                                                                                              • Outline (7)
                                                                                                              • Avoiding Communication in Iterative Linear Algebra
                                                                                                              • Outline (8)
                                                                                                              • Example The Difficulty of Tuning SpMV
                                                                                                              • Example The Difficulty of Tuning
                                                                                                              • Speedups on Itanium 2 The Need for Search
                                                                                                              • Register Profile Itanium 2
                                                                                                              • Register Profiles IBM and Intel IA-64
                                                                                                              • Another example of tuning challenges for SpMV
                                                                                                              • Zoom in to top corner
                                                                                                              • 3x3 blocks look natural buthellip
                                                                                                              • Extra Work Can Improve Efficiency
                                                                                                              • Slide 86
                                                                                                              • Slide 87
                                                                                                              • Slide 88
                                                                                                              • Slide 89
                                                                                                              • Summary of Other Performance Optimizations
                                                                                                              • Optimized Sparse Kernel Interface - OSKI
                                                                                                              • Outline (9)
                                                                                                              • Example Classical Conjugate Gradient (CG)
                                                                                                              • Example CA-Conjugate Gradient
                                                                                                              • Outline (10)
                                                                                                              • Slide 96
                                                                                                              • Slide 97
                                                                                                              • Outline (11)
                                                                                                              • What is a ldquosparse matrixrdquo
                                                                                                              • Outline (12)
                                                                                                              • Reproducible Floating Point Computation
                                                                                                              • Intel MKL non-reproducibility
                                                                                                              • GoalsApproaches for Reproducibility
                                                                                                              • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
                                                                                                              • Collaborators and Supporters
                                                                                                              • Summary

                                                                                                                b+1

                                                                                                                b+1

                                                                                                                Successive Band Reduction (BischofLangSun)

                                                                                                                1

                                                                                                                b+1

                                                                                                                b+1

                                                                                                                d+1

                                                                                                                c

                                                                                                                Successive Band Reduction (BischofLangSun)

                                                                                                                b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                                                1Q1

                                                                                                                b+1

                                                                                                                b+1

                                                                                                                d+1

                                                                                                                c

                                                                                                                b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                                                Successive Band Reduction (BischofLangSun)

                                                                                                                12

                                                                                                                Q1

                                                                                                                b+1

                                                                                                                b+1

                                                                                                                d+1

                                                                                                                d+c

                                                                                                                d+c

                                                                                                                c

                                                                                                                b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                                                Successive Band Reduction (BischofLangSun)

                                                                                                                1

                                                                                                                12

                                                                                                                Q1

                                                                                                                Q1T

                                                                                                                b+1

                                                                                                                b+1

                                                                                                                d+1

                                                                                                                d+1

                                                                                                                cd+c

                                                                                                                d+c

                                                                                                                c

                                                                                                                b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                                                Successive Band Reduction (BischofLangSun)

                                                                                                                1

                                                                                                                1

                                                                                                                2

                                                                                                                2Q1

                                                                                                                Q1T

                                                                                                                b+1

                                                                                                                b+1

                                                                                                                d+1

                                                                                                                d+1

                                                                                                                cd+c

                                                                                                                d+c

                                                                                                                d+c

                                                                                                                d+c

                                                                                                                c

                                                                                                                b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                                                Successive Band Reduction (BischofLangSun)

                                                                                                                1

                                                                                                                1

                                                                                                                2

                                                                                                                2

                                                                                                                3

                                                                                                                3

                                                                                                                Q1

                                                                                                                Q1T

                                                                                                                Q2

                                                                                                                Q2T

                                                                                                                b+1

                                                                                                                b+1

                                                                                                                d+1

                                                                                                                d+1

                                                                                                                d+c

                                                                                                                d+c

                                                                                                                d+c

                                                                                                                d+c

                                                                                                                c

                                                                                                                c

                                                                                                                b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                                                Successive Band Reduction (BischofLangSun)

                                                                                                                1

                                                                                                                1

                                                                                                                2

                                                                                                                2

                                                                                                                3

                                                                                                                3

                                                                                                                4

                                                                                                                4

                                                                                                                Q1

                                                                                                                Q1T

                                                                                                                Q2

                                                                                                                Q2T

                                                                                                                Q3

                                                                                                                Q3T

                                                                                                                b+1

                                                                                                                b+1

                                                                                                                d+1

                                                                                                                d+1

                                                                                                                d+c

                                                                                                                d+c

                                                                                                                d+c

                                                                                                                d+c

                                                                                                                c

                                                                                                                c

                                                                                                                b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                                                Successive Band Reduction (BischofLangSun)

                                                                                                                1

                                                                                                                1

                                                                                                                2

                                                                                                                2

                                                                                                                3

                                                                                                                3

                                                                                                                4

                                                                                                                4

                                                                                                                5

                                                                                                                5

                                                                                                                Q1

                                                                                                                Q1T

                                                                                                                Q2

                                                                                                                Q2T

                                                                                                                Q3

                                                                                                                Q3T

                                                                                                                Q4

                                                                                                                Q4T

                                                                                                                b+1

                                                                                                                b+1

                                                                                                                d+1

                                                                                                                d+1

                                                                                                                c

                                                                                                                c

                                                                                                                d+c

                                                                                                                d+c

                                                                                                                d+c

                                                                                                                d+c

                                                                                                                b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                                                Successive Band Reduction (BischofLangSun)

                                                                                                                1

                                                                                                                1

                                                                                                                2

                                                                                                                2

                                                                                                                3

                                                                                                                3

                                                                                                                4

                                                                                                                4

                                                                                                                5

                                                                                                                5

                                                                                                                Q5T

                                                                                                                Q1

                                                                                                                Q1T

                                                                                                                Q2

                                                                                                                Q2T

                                                                                                                Q3

                                                                                                                Q3T

                                                                                                                Q5

                                                                                                                Q4

                                                                                                                Q4T

                                                                                                                b+1

                                                                                                                b+1

                                                                                                                d+1

                                                                                                                d+1

                                                                                                                c

                                                                                                                c

                                                                                                                d+c

                                                                                                                d+c

                                                                                                                d+c

                                                                                                                d+c

                                                                                                                b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                                                Successive Band Reduction (BischofLangSun)

                                                                                                                1

                                                                                                                1

                                                                                                                2

                                                                                                                2

                                                                                                                3

                                                                                                                3

                                                                                                                4

                                                                                                                4

                                                                                                                5

                                                                                                                5

                                                                                                                6

                                                                                                                6

                                                                                                                Q5T

                                                                                                                Q1

                                                                                                                Q1T

                                                                                                                Q2

                                                                                                                Q2T

                                                                                                                Q3

                                                                                                                Q3T

                                                                                                                Q5

                                                                                                                Q4

                                                                                                                Q4T

                                                                                                                b+1

                                                                                                                b+1

                                                                                                                d+1

                                                                                                                d+1

                                                                                                                c

                                                                                                                c

                                                                                                                d+c

                                                                                                                d+c

                                                                                                                d+c

                                                                                                                d+c

                                                                                                                b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                                                Successive Band Reduction (BischofLangSun)

                                                                                                                Conventional vs CA - SBR

                                                                                                                Conventional Communication-Avoiding

                                                                                                                Touch all data 4 times Touch all data once

                                                                                                                >
                                                                                                                >

                                                                                                                Speedups of Sym Band Reductionvs DSBTRD

                                                                                                                bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

                                                                                                                bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

                                                                                                                bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

                                                                                                                bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

                                                                                                                bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

                                                                                                                Nonsymmetric Eigenproblem

                                                                                                                bull No apparent way to modify standard algorithmbull Instead Spectral Divide-and-Conquer

                                                                                                                ndash Find orthogonal matrix Q whose leading columns span an invariant subspace of A

                                                                                                                ndash QTAQ will be block upper triangularndash Apply recursively to A11 A22

                                                                                                                ndash Depends on randomization1 Randomized Rank Revealing QR decomposition2 Randomized location to try splitting spectrum

                                                                                                                A11 A12

                                                                                                                ε A22

                                                                                                                Attaining the Lower bounds SequentialLegend[Existing][Ours][Math-Lib][Random]

                                                                                                                Two Levels Memory Hierarchy

                                                                                                                Words Messages Words Messages

                                                                                                                BLAS-3 [FLPRrsquo99][BDLSTrsquo13][MKL etc] [FLPRrsquo99][BDLSTrsquo13][MKL etc]

                                                                                                                Cholesky[Grsquo97][APrsquo00]

                                                                                                                [LAPACK][BDHSrsquo09]

                                                                                                                [Grsquo97][APrsquo00][BDHSrsquo09] [Grsquo97][APrsquo00][BDHSrsquo09]

                                                                                                                Sym Indefinite [BBDDDPSTYrsquo13] [BBDDDPSTYrsquo13]

                                                                                                                LU[Grsquo97][Trsquo97]

                                                                                                                [GDXrsquo11][BDLSTrsquo13]

                                                                                                                [GDXrsquo11][BDLSTrsquo13]

                                                                                                                [Grsquo97][Trsquo97] [BDLSTrsquo13] [BDLSTrsquo13]

                                                                                                                QR[EGrsquo98][FWrsquo03]

                                                                                                                [DGHLrsquo12][BDLSTrsquo13]

                                                                                                                [FWrsquo03][DGHLrsquo12][BDLSTrsquo13]

                                                                                                                [EGrsquo98][FWrsquo03][BDLSTrsquo13]

                                                                                                                [FWrsquo03][BDLSTrsquo13]

                                                                                                                Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

                                                                                                                Sym Eig amp SVD [BDDrsquo11][BDKrsquo13] [BDDrsquo11]

                                                                                                                Non Sym Eig [BDDrsquo11] [BDDrsquo11]

                                                                                                                Legend[Existing][Ours][Math-Lib][Random]

                                                                                                                Words (BW) Messages (L) Saving factor

                                                                                                                BLAS-3 [AGZrsquo94][MTrsquo99][ScaLAPACK][Crsquo69][vGWrsquo97][SDrsquo11] L nP12

                                                                                                                Cholesky [ScaLAPACK][Trsquo99][SDrsquo11] L nP12

                                                                                                                Sym Indefinite [BBDDDPSTYrsquo13][ScaLAPACK] [BBDDDPSTYrsquo13] L nP12

                                                                                                                LU [ScaLAPACK][GDXrsquo11][Trsquo99][SDrsquo11] [GDXrsquo11][Trsquo99][SDrsquo11] L nP12

                                                                                                                QR [ScaLAPACK][DGHLrsquo12] [Trsquo99] [DGHLrsquo12][Trsquo99] L nP12

                                                                                                                Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

                                                                                                                Sym Eig amp SVD [BDDrsquo11][BDKrsquo13][ScaLAPACK] [BDDrsquo11][BDKrsquo13] L nP12

                                                                                                                Non-Sym Eig [BDDrsquo11] [BDDrsquo11] BW P12 L n

                                                                                                                Attaining with extra memory 25D M=(cn2P)

                                                                                                                Attaining the Lower bounds Parallel 2DM=(n2P)(Ignoring poly-log(P) factors words = ( n2 P12) messages = (P12)

                                                                                                                Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                Avoiding Communication in Iterative Linear Algebra

                                                                                                                bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

                                                                                                                bull Conjugate Gradients (CG) GMRES Lanczos Arnoldi hellip bull Goal minimize communication

                                                                                                                ndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

                                                                                                                bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

                                                                                                                ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

                                                                                                                bull Lots of speed up possible (modeled and measured)ndash Price some redundant computationndash Challenges Poor partitioning Preconditioning Num Stability

                                                                                                                75

                                                                                                                Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                Example The Difficulty of Tuning SpMV

                                                                                                                bull n = 21200bull nnz = 15 M

                                                                                                                bull Source NASA structural analysis problem (raefsky)

                                                                                                                77

                                                                                                                Example The Difficulty of Tuning

                                                                                                                bull n = 21200bull nnz = 15 M

                                                                                                                bull Source NASA structural analysis problem (raefsky)

                                                                                                                bull 8x8 dense substructure exploit this to limit mem_refs

                                                                                                                78

                                                                                                                Speedups on Itanium 2 The Need for Search

                                                                                                                Reference

                                                                                                                Best 4x2

                                                                                                                Mflops

                                                                                                                Mflops

                                                                                                                79

                                                                                                                Register Profile Itanium 2

                                                                                                                190 Mflops

                                                                                                                1190 Mflops

                                                                                                                80

                                                                                                                Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16

                                                                                                                Itanium 2 - 33Itanium 1 - 8

                                                                                                                252 Mflops

                                                                                                                122 Mflops

                                                                                                                820 Mflops

                                                                                                                459 Mflops

                                                                                                                247 Mflops

                                                                                                                107 Mflops

                                                                                                                12 Gflops

                                                                                                                190 Mflops

                                                                                                                Another example of tuning challenges for SpMV

                                                                                                                bull Ex11 matrix (fluid flow)

                                                                                                                bull More complicated non-zero structure in general

                                                                                                                bull N = 16614bull NNZ = 11M

                                                                                                                82

                                                                                                                Zoom in to top corner

                                                                                                                bull More complicated non-zero structure in general

                                                                                                                bull N = 16614bull NNZ = 11M

                                                                                                                83

                                                                                                                3x3 blocks look natural buthellip

                                                                                                                bull Example 3x3 blockingndash Logical grid of 3x3 cells

                                                                                                                bull But would lead to lots of ldquofill-inrdquo

                                                                                                                84

                                                                                                                Extra Work Can Improve Efficiency

                                                                                                                bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

                                                                                                                bull On Pentium III 15x speedupndash Actual mflop rate 152 = 225 higher

                                                                                                                85

                                                                                                                Source Accelerator Cavity Design Problem (Ko via Husbands)

                                                                                                                86

                                                                                                                100x100 Submatrix Along Diagonal

                                                                                                                Summer School Lecture 787

                                                                                                                Post-RCM Reordering

                                                                                                                88

                                                                                                                Effect of Combined RCM+TSP Reordering

                                                                                                                Before Green + RedAfter Green + Blue

                                                                                                                Summer School Lecture 789

                                                                                                                2x speedups on Pentium 4 Power 4 hellip

                                                                                                                Summary of Other Performance Optimizations

                                                                                                                bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

                                                                                                                bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

                                                                                                                bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

                                                                                                                90

                                                                                                                Optimized Sparse Kernel Interface - OSKI

                                                                                                                bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

                                                                                                                bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

                                                                                                                bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

                                                                                                                software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

                                                                                                                91

                                                                                                                Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                93

                                                                                                                Example Classical Conjugate Gradient (CG)

                                                                                                                SpMVs and dot products require communication in

                                                                                                                each iteration

                                                                                                                via CA Matrix Powers Kernel

                                                                                                                Global reduction to compute G

                                                                                                                94

                                                                                                                Example CA-Conjugate Gradient

                                                                                                                Local computations within inner loop require

                                                                                                                no communication

                                                                                                                Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                96

                                                                                                                Slower convergence due

                                                                                                                to roundoff

                                                                                                                Loss of accuracy due to roundoff

                                                                                                                At s = 16 monomial basis is rank deficient Method breaks down

                                                                                                                Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

                                                                                                                CA-CG (monomial)CG

                                                                                                                machine precision

                                                                                                                97

                                                                                                                Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

                                                                                                                bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

                                                                                                                bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

                                                                                                                matrices

                                                                                                                Explicit (O(nnz)) Implicit (o(nnz))

                                                                                                                Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

                                                                                                                Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

                                                                                                                Indices

                                                                                                                Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                101

                                                                                                                bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

                                                                                                                ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

                                                                                                                demanded by customers (construction engineers) otherwise they donrsquot believe results

                                                                                                                ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

                                                                                                                ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

                                                                                                                Reproducible Floating Point Computation

                                                                                                                Absolute Error for Random Vectors

                                                                                                                Same magnitude opposite signs

                                                                                                                Intel MKL non-reproducibility

                                                                                                                Relative Error for Orthogonal vectors

                                                                                                                Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

                                                                                                                Sign notreproducible

                                                                                                                103

                                                                                                                bull Consider summation or dot productbull Goals

                                                                                                                1 Same answer independent of layout processors order of summands

                                                                                                                2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

                                                                                                                bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

                                                                                                                GoalsApproaches for Reproducibility

                                                                                                                104

                                                                                                                Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

                                                                                                                Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

                                                                                                                Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

                                                                                                                bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

                                                                                                                Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

                                                                                                                bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

                                                                                                                Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

                                                                                                                Instruments NEC Nokia NVIDIA Samsung Oracle

                                                                                                                bull bebopcsberkeleyedu

                                                                                                                Summary

                                                                                                                Donrsquot Communichellip

                                                                                                                106

                                                                                                                Time to redesign all linear algebra n-body hellip algorithms and software

                                                                                                                (and compilers)

                                                                                                                • Implementing Communication-Avoiding Algorithms
                                                                                                                • Why avoid communication
                                                                                                                • Goals
                                                                                                                • Outline
                                                                                                                • Outline (2)
                                                                                                                • Lower bound for all ldquon3-likerdquo linear algebra
                                                                                                                • Lower bound for all ldquon3-likerdquo linear algebra (2)
                                                                                                                • Lower bound for all ldquon3-likerdquo linear algebra (3)
                                                                                                                • Limits to parallel scaling (12)
                                                                                                                • Limits to parallel scaling (22)
                                                                                                                • Can we attain these lower bounds
                                                                                                                • Outline (3)
                                                                                                                • 25D Matrix Multiplication
                                                                                                                • 25D Matrix Multiplication (2)
                                                                                                                • 25D Matmul on BGP 16K nodes 64K cores (2)
                                                                                                                • Perfect Strong Scaling ndash in Time and Energy (12)
                                                                                                                • Perfect Strong Scaling ndash in Time and Energy (22)
                                                                                                                • Handling Heterogeneity
                                                                                                                • Application to Tensor Contractions
                                                                                                                • C(ijk) = Σm A(ijm)B(mk)
                                                                                                                • Application to Tensor Contractions (2)
                                                                                                                • Communication Lower Bounds for Strassen-like matmul algorithms
                                                                                                                • vs
                                                                                                                • Slide 26
                                                                                                                • Strassen-like beyond matmul
                                                                                                                • Cache and Network Oblivious Algorithms
                                                                                                                • CARMA Performance Distributed Memory
                                                                                                                • CARMA Performance Distributed Memory (2)
                                                                                                                • CARMA Performance Shared Memory
                                                                                                                • CARMA Performance Shared Memory (2)
                                                                                                                • Why is CARMA Faster in Shared Memory
                                                                                                                • Outline (4)
                                                                                                                • One-sided Factorizations (LU QR) so far
                                                                                                                • TSQR An Architecture-Dependent Algorithm
                                                                                                                • Back to LU Using similar idea for TSLU as TSQR Use reduction
                                                                                                                • Minimizing Communication in TSLU
                                                                                                                • Making TSLU Numerically Stable
                                                                                                                • Stability of LU using TSLU CALU
                                                                                                                • Why is stability of TSLU just a ldquoThmrdquo
                                                                                                                • Fixing TSLU
                                                                                                                • 2D CALU with Tournament Pivoting
                                                                                                                • 25D CALU with Tournament Pivoting (c=4 copies)
                                                                                                                • Exascale Machine Parameters Source DOE Exascale Workshop
                                                                                                                • Exascale predicted speedups for Gaussian Elimination 2D CA
                                                                                                                • 25D vs 2D LU With and Without Pivoting
                                                                                                                • Other CA algorithms for Ax=b least squares(13)
                                                                                                                • Other CA algorithms for Ax=b least squares (23)
                                                                                                                • Other CA algorithms for Ax=b least squares (33)
                                                                                                                • Outline (5)
                                                                                                                • What about sparse matrices (13)
                                                                                                                • Performance of 25D APSP using Kleene
                                                                                                                • What about sparse matrices (23)
                                                                                                                • What about sparse matrices (33)
                                                                                                                • Outline (6)
                                                                                                                • Symmetric Eigenproblem and SVD
                                                                                                                • Slide 58
                                                                                                                • Slide 59
                                                                                                                • Slide 60
                                                                                                                • Slide 61
                                                                                                                • Slide 62
                                                                                                                • Slide 63
                                                                                                                • Slide 64
                                                                                                                • Slide 65
                                                                                                                • Slide 66
                                                                                                                • Slide 67
                                                                                                                • Slide 68
                                                                                                                • Conventional vs CA - SBR
                                                                                                                • Speedups of Sym Band Reduction vs DSBTRD
                                                                                                                • Nonsymmetric Eigenproblem
                                                                                                                • Attaining the Lower bounds Sequential
                                                                                                                • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
                                                                                                                • Outline (7)
                                                                                                                • Avoiding Communication in Iterative Linear Algebra
                                                                                                                • Outline (8)
                                                                                                                • Example The Difficulty of Tuning SpMV
                                                                                                                • Example The Difficulty of Tuning
                                                                                                                • Speedups on Itanium 2 The Need for Search
                                                                                                                • Register Profile Itanium 2
                                                                                                                • Register Profiles IBM and Intel IA-64
                                                                                                                • Another example of tuning challenges for SpMV
                                                                                                                • Zoom in to top corner
                                                                                                                • 3x3 blocks look natural buthellip
                                                                                                                • Extra Work Can Improve Efficiency
                                                                                                                • Slide 86
                                                                                                                • Slide 87
                                                                                                                • Slide 88
                                                                                                                • Slide 89
                                                                                                                • Summary of Other Performance Optimizations
                                                                                                                • Optimized Sparse Kernel Interface - OSKI
                                                                                                                • Outline (9)
                                                                                                                • Example Classical Conjugate Gradient (CG)
                                                                                                                • Example CA-Conjugate Gradient
                                                                                                                • Outline (10)
                                                                                                                • Slide 96
                                                                                                                • Slide 97
                                                                                                                • Outline (11)
                                                                                                                • What is a ldquosparse matrixrdquo
                                                                                                                • Outline (12)
                                                                                                                • Reproducible Floating Point Computation
                                                                                                                • Intel MKL non-reproducibility
                                                                                                                • GoalsApproaches for Reproducibility
                                                                                                                • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
                                                                                                                • Collaborators and Supporters
                                                                                                                • Summary

                                                                                                                  1

                                                                                                                  b+1

                                                                                                                  b+1

                                                                                                                  d+1

                                                                                                                  c

                                                                                                                  Successive Band Reduction (BischofLangSun)

                                                                                                                  b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                                                  1Q1

                                                                                                                  b+1

                                                                                                                  b+1

                                                                                                                  d+1

                                                                                                                  c

                                                                                                                  b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                                                  Successive Band Reduction (BischofLangSun)

                                                                                                                  12

                                                                                                                  Q1

                                                                                                                  b+1

                                                                                                                  b+1

                                                                                                                  d+1

                                                                                                                  d+c

                                                                                                                  d+c

                                                                                                                  c

                                                                                                                  b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                                                  Successive Band Reduction (BischofLangSun)

                                                                                                                  1

                                                                                                                  12

                                                                                                                  Q1

                                                                                                                  Q1T

                                                                                                                  b+1

                                                                                                                  b+1

                                                                                                                  d+1

                                                                                                                  d+1

                                                                                                                  cd+c

                                                                                                                  d+c

                                                                                                                  c

                                                                                                                  b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                                                  Successive Band Reduction (BischofLangSun)

                                                                                                                  1

                                                                                                                  1

                                                                                                                  2

                                                                                                                  2Q1

                                                                                                                  Q1T

                                                                                                                  b+1

                                                                                                                  b+1

                                                                                                                  d+1

                                                                                                                  d+1

                                                                                                                  cd+c

                                                                                                                  d+c

                                                                                                                  d+c

                                                                                                                  d+c

                                                                                                                  c

                                                                                                                  b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                                                  Successive Band Reduction (BischofLangSun)

                                                                                                                  1

                                                                                                                  1

                                                                                                                  2

                                                                                                                  2

                                                                                                                  3

                                                                                                                  3

                                                                                                                  Q1

                                                                                                                  Q1T

                                                                                                                  Q2

                                                                                                                  Q2T

                                                                                                                  b+1

                                                                                                                  b+1

                                                                                                                  d+1

                                                                                                                  d+1

                                                                                                                  d+c

                                                                                                                  d+c

                                                                                                                  d+c

                                                                                                                  d+c

                                                                                                                  c

                                                                                                                  c

                                                                                                                  b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                                                  Successive Band Reduction (BischofLangSun)

                                                                                                                  1

                                                                                                                  1

                                                                                                                  2

                                                                                                                  2

                                                                                                                  3

                                                                                                                  3

                                                                                                                  4

                                                                                                                  4

                                                                                                                  Q1

                                                                                                                  Q1T

                                                                                                                  Q2

                                                                                                                  Q2T

                                                                                                                  Q3

                                                                                                                  Q3T

                                                                                                                  b+1

                                                                                                                  b+1

                                                                                                                  d+1

                                                                                                                  d+1

                                                                                                                  d+c

                                                                                                                  d+c

                                                                                                                  d+c

                                                                                                                  d+c

                                                                                                                  c

                                                                                                                  c

                                                                                                                  b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                                                  Successive Band Reduction (BischofLangSun)

                                                                                                                  1

                                                                                                                  1

                                                                                                                  2

                                                                                                                  2

                                                                                                                  3

                                                                                                                  3

                                                                                                                  4

                                                                                                                  4

                                                                                                                  5

                                                                                                                  5

                                                                                                                  Q1

                                                                                                                  Q1T

                                                                                                                  Q2

                                                                                                                  Q2T

                                                                                                                  Q3

                                                                                                                  Q3T

                                                                                                                  Q4

                                                                                                                  Q4T

                                                                                                                  b+1

                                                                                                                  b+1

                                                                                                                  d+1

                                                                                                                  d+1

                                                                                                                  c

                                                                                                                  c

                                                                                                                  d+c

                                                                                                                  d+c

                                                                                                                  d+c

                                                                                                                  d+c

                                                                                                                  b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                                                  Successive Band Reduction (BischofLangSun)

                                                                                                                  1

                                                                                                                  1

                                                                                                                  2

                                                                                                                  2

                                                                                                                  3

                                                                                                                  3

                                                                                                                  4

                                                                                                                  4

                                                                                                                  5

                                                                                                                  5

                                                                                                                  Q5T

                                                                                                                  Q1

                                                                                                                  Q1T

                                                                                                                  Q2

                                                                                                                  Q2T

                                                                                                                  Q3

                                                                                                                  Q3T

                                                                                                                  Q5

                                                                                                                  Q4

                                                                                                                  Q4T

                                                                                                                  b+1

                                                                                                                  b+1

                                                                                                                  d+1

                                                                                                                  d+1

                                                                                                                  c

                                                                                                                  c

                                                                                                                  d+c

                                                                                                                  d+c

                                                                                                                  d+c

                                                                                                                  d+c

                                                                                                                  b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                                                  Successive Band Reduction (BischofLangSun)

                                                                                                                  1

                                                                                                                  1

                                                                                                                  2

                                                                                                                  2

                                                                                                                  3

                                                                                                                  3

                                                                                                                  4

                                                                                                                  4

                                                                                                                  5

                                                                                                                  5

                                                                                                                  6

                                                                                                                  6

                                                                                                                  Q5T

                                                                                                                  Q1

                                                                                                                  Q1T

                                                                                                                  Q2

                                                                                                                  Q2T

                                                                                                                  Q3

                                                                                                                  Q3T

                                                                                                                  Q5

                                                                                                                  Q4

                                                                                                                  Q4T

                                                                                                                  b+1

                                                                                                                  b+1

                                                                                                                  d+1

                                                                                                                  d+1

                                                                                                                  c

                                                                                                                  c

                                                                                                                  d+c

                                                                                                                  d+c

                                                                                                                  d+c

                                                                                                                  d+c

                                                                                                                  b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                                                  Successive Band Reduction (BischofLangSun)

                                                                                                                  Conventional vs CA - SBR

                                                                                                                  Conventional Communication-Avoiding

                                                                                                                  Touch all data 4 times Touch all data once

                                                                                                                  >
                                                                                                                  >

                                                                                                                  Speedups of Sym Band Reductionvs DSBTRD

                                                                                                                  bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

                                                                                                                  bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

                                                                                                                  bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

                                                                                                                  bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

                                                                                                                  bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

                                                                                                                  Nonsymmetric Eigenproblem

                                                                                                                  bull No apparent way to modify standard algorithmbull Instead Spectral Divide-and-Conquer

                                                                                                                  ndash Find orthogonal matrix Q whose leading columns span an invariant subspace of A

                                                                                                                  ndash QTAQ will be block upper triangularndash Apply recursively to A11 A22

                                                                                                                  ndash Depends on randomization1 Randomized Rank Revealing QR decomposition2 Randomized location to try splitting spectrum

                                                                                                                  A11 A12

                                                                                                                  ε A22

                                                                                                                  Attaining the Lower bounds SequentialLegend[Existing][Ours][Math-Lib][Random]

                                                                                                                  Two Levels Memory Hierarchy

                                                                                                                  Words Messages Words Messages

                                                                                                                  BLAS-3 [FLPRrsquo99][BDLSTrsquo13][MKL etc] [FLPRrsquo99][BDLSTrsquo13][MKL etc]

                                                                                                                  Cholesky[Grsquo97][APrsquo00]

                                                                                                                  [LAPACK][BDHSrsquo09]

                                                                                                                  [Grsquo97][APrsquo00][BDHSrsquo09] [Grsquo97][APrsquo00][BDHSrsquo09]

                                                                                                                  Sym Indefinite [BBDDDPSTYrsquo13] [BBDDDPSTYrsquo13]

                                                                                                                  LU[Grsquo97][Trsquo97]

                                                                                                                  [GDXrsquo11][BDLSTrsquo13]

                                                                                                                  [GDXrsquo11][BDLSTrsquo13]

                                                                                                                  [Grsquo97][Trsquo97] [BDLSTrsquo13] [BDLSTrsquo13]

                                                                                                                  QR[EGrsquo98][FWrsquo03]

                                                                                                                  [DGHLrsquo12][BDLSTrsquo13]

                                                                                                                  [FWrsquo03][DGHLrsquo12][BDLSTrsquo13]

                                                                                                                  [EGrsquo98][FWrsquo03][BDLSTrsquo13]

                                                                                                                  [FWrsquo03][BDLSTrsquo13]

                                                                                                                  Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

                                                                                                                  Sym Eig amp SVD [BDDrsquo11][BDKrsquo13] [BDDrsquo11]

                                                                                                                  Non Sym Eig [BDDrsquo11] [BDDrsquo11]

                                                                                                                  Legend[Existing][Ours][Math-Lib][Random]

                                                                                                                  Words (BW) Messages (L) Saving factor

                                                                                                                  BLAS-3 [AGZrsquo94][MTrsquo99][ScaLAPACK][Crsquo69][vGWrsquo97][SDrsquo11] L nP12

                                                                                                                  Cholesky [ScaLAPACK][Trsquo99][SDrsquo11] L nP12

                                                                                                                  Sym Indefinite [BBDDDPSTYrsquo13][ScaLAPACK] [BBDDDPSTYrsquo13] L nP12

                                                                                                                  LU [ScaLAPACK][GDXrsquo11][Trsquo99][SDrsquo11] [GDXrsquo11][Trsquo99][SDrsquo11] L nP12

                                                                                                                  QR [ScaLAPACK][DGHLrsquo12] [Trsquo99] [DGHLrsquo12][Trsquo99] L nP12

                                                                                                                  Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

                                                                                                                  Sym Eig amp SVD [BDDrsquo11][BDKrsquo13][ScaLAPACK] [BDDrsquo11][BDKrsquo13] L nP12

                                                                                                                  Non-Sym Eig [BDDrsquo11] [BDDrsquo11] BW P12 L n

                                                                                                                  Attaining with extra memory 25D M=(cn2P)

                                                                                                                  Attaining the Lower bounds Parallel 2DM=(n2P)(Ignoring poly-log(P) factors words = ( n2 P12) messages = (P12)

                                                                                                                  Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                  ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                  ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                  bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                  bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                  Avoiding Communication in Iterative Linear Algebra

                                                                                                                  bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

                                                                                                                  bull Conjugate Gradients (CG) GMRES Lanczos Arnoldi hellip bull Goal minimize communication

                                                                                                                  ndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

                                                                                                                  bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

                                                                                                                  ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

                                                                                                                  bull Lots of speed up possible (modeled and measured)ndash Price some redundant computationndash Challenges Poor partitioning Preconditioning Num Stability

                                                                                                                  75

                                                                                                                  Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                  ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                  ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                  bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                  bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                  Example The Difficulty of Tuning SpMV

                                                                                                                  bull n = 21200bull nnz = 15 M

                                                                                                                  bull Source NASA structural analysis problem (raefsky)

                                                                                                                  77

                                                                                                                  Example The Difficulty of Tuning

                                                                                                                  bull n = 21200bull nnz = 15 M

                                                                                                                  bull Source NASA structural analysis problem (raefsky)

                                                                                                                  bull 8x8 dense substructure exploit this to limit mem_refs

                                                                                                                  78

                                                                                                                  Speedups on Itanium 2 The Need for Search

                                                                                                                  Reference

                                                                                                                  Best 4x2

                                                                                                                  Mflops

                                                                                                                  Mflops

                                                                                                                  79

                                                                                                                  Register Profile Itanium 2

                                                                                                                  190 Mflops

                                                                                                                  1190 Mflops

                                                                                                                  80

                                                                                                                  Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16

                                                                                                                  Itanium 2 - 33Itanium 1 - 8

                                                                                                                  252 Mflops

                                                                                                                  122 Mflops

                                                                                                                  820 Mflops

                                                                                                                  459 Mflops

                                                                                                                  247 Mflops

                                                                                                                  107 Mflops

                                                                                                                  12 Gflops

                                                                                                                  190 Mflops

                                                                                                                  Another example of tuning challenges for SpMV

                                                                                                                  bull Ex11 matrix (fluid flow)

                                                                                                                  bull More complicated non-zero structure in general

                                                                                                                  bull N = 16614bull NNZ = 11M

                                                                                                                  82

                                                                                                                  Zoom in to top corner

                                                                                                                  bull More complicated non-zero structure in general

                                                                                                                  bull N = 16614bull NNZ = 11M

                                                                                                                  83

                                                                                                                  3x3 blocks look natural buthellip

                                                                                                                  bull Example 3x3 blockingndash Logical grid of 3x3 cells

                                                                                                                  bull But would lead to lots of ldquofill-inrdquo

                                                                                                                  84

                                                                                                                  Extra Work Can Improve Efficiency

                                                                                                                  bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

                                                                                                                  bull On Pentium III 15x speedupndash Actual mflop rate 152 = 225 higher

                                                                                                                  85

                                                                                                                  Source Accelerator Cavity Design Problem (Ko via Husbands)

                                                                                                                  86

                                                                                                                  100x100 Submatrix Along Diagonal

                                                                                                                  Summer School Lecture 787

                                                                                                                  Post-RCM Reordering

                                                                                                                  88

                                                                                                                  Effect of Combined RCM+TSP Reordering

                                                                                                                  Before Green + RedAfter Green + Blue

                                                                                                                  Summer School Lecture 789

                                                                                                                  2x speedups on Pentium 4 Power 4 hellip

                                                                                                                  Summary of Other Performance Optimizations

                                                                                                                  bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

                                                                                                                  bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

                                                                                                                  bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

                                                                                                                  90

                                                                                                                  Optimized Sparse Kernel Interface - OSKI

                                                                                                                  bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

                                                                                                                  bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

                                                                                                                  bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

                                                                                                                  software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

                                                                                                                  91

                                                                                                                  Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                  ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                  ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                  bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                  bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                  93

                                                                                                                  Example Classical Conjugate Gradient (CG)

                                                                                                                  SpMVs and dot products require communication in

                                                                                                                  each iteration

                                                                                                                  via CA Matrix Powers Kernel

                                                                                                                  Global reduction to compute G

                                                                                                                  94

                                                                                                                  Example CA-Conjugate Gradient

                                                                                                                  Local computations within inner loop require

                                                                                                                  no communication

                                                                                                                  Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                  ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                  ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                  bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                  bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                  96

                                                                                                                  Slower convergence due

                                                                                                                  to roundoff

                                                                                                                  Loss of accuracy due to roundoff

                                                                                                                  At s = 16 monomial basis is rank deficient Method breaks down

                                                                                                                  Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

                                                                                                                  CA-CG (monomial)CG

                                                                                                                  machine precision

                                                                                                                  97

                                                                                                                  Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                  ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                  ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                  bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                  bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                  What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

                                                                                                                  bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

                                                                                                                  bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

                                                                                                                  matrices

                                                                                                                  Explicit (O(nnz)) Implicit (o(nnz))

                                                                                                                  Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

                                                                                                                  Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

                                                                                                                  Indices

                                                                                                                  Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                  ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                  ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                  bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                  bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                  101

                                                                                                                  bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

                                                                                                                  ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

                                                                                                                  demanded by customers (construction engineers) otherwise they donrsquot believe results

                                                                                                                  ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

                                                                                                                  ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

                                                                                                                  Reproducible Floating Point Computation

                                                                                                                  Absolute Error for Random Vectors

                                                                                                                  Same magnitude opposite signs

                                                                                                                  Intel MKL non-reproducibility

                                                                                                                  Relative Error for Orthogonal vectors

                                                                                                                  Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

                                                                                                                  Sign notreproducible

                                                                                                                  103

                                                                                                                  bull Consider summation or dot productbull Goals

                                                                                                                  1 Same answer independent of layout processors order of summands

                                                                                                                  2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

                                                                                                                  bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

                                                                                                                  GoalsApproaches for Reproducibility

                                                                                                                  104

                                                                                                                  Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

                                                                                                                  Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

                                                                                                                  Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

                                                                                                                  bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

                                                                                                                  Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

                                                                                                                  bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

                                                                                                                  Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

                                                                                                                  Instruments NEC Nokia NVIDIA Samsung Oracle

                                                                                                                  bull bebopcsberkeleyedu

                                                                                                                  Summary

                                                                                                                  Donrsquot Communichellip

                                                                                                                  106

                                                                                                                  Time to redesign all linear algebra n-body hellip algorithms and software

                                                                                                                  (and compilers)

                                                                                                                  • Implementing Communication-Avoiding Algorithms
                                                                                                                  • Why avoid communication
                                                                                                                  • Goals
                                                                                                                  • Outline
                                                                                                                  • Outline (2)
                                                                                                                  • Lower bound for all ldquon3-likerdquo linear algebra
                                                                                                                  • Lower bound for all ldquon3-likerdquo linear algebra (2)
                                                                                                                  • Lower bound for all ldquon3-likerdquo linear algebra (3)
                                                                                                                  • Limits to parallel scaling (12)
                                                                                                                  • Limits to parallel scaling (22)
                                                                                                                  • Can we attain these lower bounds
                                                                                                                  • Outline (3)
                                                                                                                  • 25D Matrix Multiplication
                                                                                                                  • 25D Matrix Multiplication (2)
                                                                                                                  • 25D Matmul on BGP 16K nodes 64K cores (2)
                                                                                                                  • Perfect Strong Scaling ndash in Time and Energy (12)
                                                                                                                  • Perfect Strong Scaling ndash in Time and Energy (22)
                                                                                                                  • Handling Heterogeneity
                                                                                                                  • Application to Tensor Contractions
                                                                                                                  • C(ijk) = Σm A(ijm)B(mk)
                                                                                                                  • Application to Tensor Contractions (2)
                                                                                                                  • Communication Lower Bounds for Strassen-like matmul algorithms
                                                                                                                  • vs
                                                                                                                  • Slide 26
                                                                                                                  • Strassen-like beyond matmul
                                                                                                                  • Cache and Network Oblivious Algorithms
                                                                                                                  • CARMA Performance Distributed Memory
                                                                                                                  • CARMA Performance Distributed Memory (2)
                                                                                                                  • CARMA Performance Shared Memory
                                                                                                                  • CARMA Performance Shared Memory (2)
                                                                                                                  • Why is CARMA Faster in Shared Memory
                                                                                                                  • Outline (4)
                                                                                                                  • One-sided Factorizations (LU QR) so far
                                                                                                                  • TSQR An Architecture-Dependent Algorithm
                                                                                                                  • Back to LU Using similar idea for TSLU as TSQR Use reduction
                                                                                                                  • Minimizing Communication in TSLU
                                                                                                                  • Making TSLU Numerically Stable
                                                                                                                  • Stability of LU using TSLU CALU
                                                                                                                  • Why is stability of TSLU just a ldquoThmrdquo
                                                                                                                  • Fixing TSLU
                                                                                                                  • 2D CALU with Tournament Pivoting
                                                                                                                  • 25D CALU with Tournament Pivoting (c=4 copies)
                                                                                                                  • Exascale Machine Parameters Source DOE Exascale Workshop
                                                                                                                  • Exascale predicted speedups for Gaussian Elimination 2D CA
                                                                                                                  • 25D vs 2D LU With and Without Pivoting
                                                                                                                  • Other CA algorithms for Ax=b least squares(13)
                                                                                                                  • Other CA algorithms for Ax=b least squares (23)
                                                                                                                  • Other CA algorithms for Ax=b least squares (33)
                                                                                                                  • Outline (5)
                                                                                                                  • What about sparse matrices (13)
                                                                                                                  • Performance of 25D APSP using Kleene
                                                                                                                  • What about sparse matrices (23)
                                                                                                                  • What about sparse matrices (33)
                                                                                                                  • Outline (6)
                                                                                                                  • Symmetric Eigenproblem and SVD
                                                                                                                  • Slide 58
                                                                                                                  • Slide 59
                                                                                                                  • Slide 60
                                                                                                                  • Slide 61
                                                                                                                  • Slide 62
                                                                                                                  • Slide 63
                                                                                                                  • Slide 64
                                                                                                                  • Slide 65
                                                                                                                  • Slide 66
                                                                                                                  • Slide 67
                                                                                                                  • Slide 68
                                                                                                                  • Conventional vs CA - SBR
                                                                                                                  • Speedups of Sym Band Reduction vs DSBTRD
                                                                                                                  • Nonsymmetric Eigenproblem
                                                                                                                  • Attaining the Lower bounds Sequential
                                                                                                                  • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
                                                                                                                  • Outline (7)
                                                                                                                  • Avoiding Communication in Iterative Linear Algebra
                                                                                                                  • Outline (8)
                                                                                                                  • Example The Difficulty of Tuning SpMV
                                                                                                                  • Example The Difficulty of Tuning
                                                                                                                  • Speedups on Itanium 2 The Need for Search
                                                                                                                  • Register Profile Itanium 2
                                                                                                                  • Register Profiles IBM and Intel IA-64
                                                                                                                  • Another example of tuning challenges for SpMV
                                                                                                                  • Zoom in to top corner
                                                                                                                  • 3x3 blocks look natural buthellip
                                                                                                                  • Extra Work Can Improve Efficiency
                                                                                                                  • Slide 86
                                                                                                                  • Slide 87
                                                                                                                  • Slide 88
                                                                                                                  • Slide 89
                                                                                                                  • Summary of Other Performance Optimizations
                                                                                                                  • Optimized Sparse Kernel Interface - OSKI
                                                                                                                  • Outline (9)
                                                                                                                  • Example Classical Conjugate Gradient (CG)
                                                                                                                  • Example CA-Conjugate Gradient
                                                                                                                  • Outline (10)
                                                                                                                  • Slide 96
                                                                                                                  • Slide 97
                                                                                                                  • Outline (11)
                                                                                                                  • What is a ldquosparse matrixrdquo
                                                                                                                  • Outline (12)
                                                                                                                  • Reproducible Floating Point Computation
                                                                                                                  • Intel MKL non-reproducibility
                                                                                                                  • GoalsApproaches for Reproducibility
                                                                                                                  • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
                                                                                                                  • Collaborators and Supporters
                                                                                                                  • Summary

                                                                                                                    1Q1

                                                                                                                    b+1

                                                                                                                    b+1

                                                                                                                    d+1

                                                                                                                    c

                                                                                                                    b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                                                    Successive Band Reduction (BischofLangSun)

                                                                                                                    12

                                                                                                                    Q1

                                                                                                                    b+1

                                                                                                                    b+1

                                                                                                                    d+1

                                                                                                                    d+c

                                                                                                                    d+c

                                                                                                                    c

                                                                                                                    b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                                                    Successive Band Reduction (BischofLangSun)

                                                                                                                    1

                                                                                                                    12

                                                                                                                    Q1

                                                                                                                    Q1T

                                                                                                                    b+1

                                                                                                                    b+1

                                                                                                                    d+1

                                                                                                                    d+1

                                                                                                                    cd+c

                                                                                                                    d+c

                                                                                                                    c

                                                                                                                    b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                                                    Successive Band Reduction (BischofLangSun)

                                                                                                                    1

                                                                                                                    1

                                                                                                                    2

                                                                                                                    2Q1

                                                                                                                    Q1T

                                                                                                                    b+1

                                                                                                                    b+1

                                                                                                                    d+1

                                                                                                                    d+1

                                                                                                                    cd+c

                                                                                                                    d+c

                                                                                                                    d+c

                                                                                                                    d+c

                                                                                                                    c

                                                                                                                    b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                                                    Successive Band Reduction (BischofLangSun)

                                                                                                                    1

                                                                                                                    1

                                                                                                                    2

                                                                                                                    2

                                                                                                                    3

                                                                                                                    3

                                                                                                                    Q1

                                                                                                                    Q1T

                                                                                                                    Q2

                                                                                                                    Q2T

                                                                                                                    b+1

                                                                                                                    b+1

                                                                                                                    d+1

                                                                                                                    d+1

                                                                                                                    d+c

                                                                                                                    d+c

                                                                                                                    d+c

                                                                                                                    d+c

                                                                                                                    c

                                                                                                                    c

                                                                                                                    b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                                                    Successive Band Reduction (BischofLangSun)

                                                                                                                    1

                                                                                                                    1

                                                                                                                    2

                                                                                                                    2

                                                                                                                    3

                                                                                                                    3

                                                                                                                    4

                                                                                                                    4

                                                                                                                    Q1

                                                                                                                    Q1T

                                                                                                                    Q2

                                                                                                                    Q2T

                                                                                                                    Q3

                                                                                                                    Q3T

                                                                                                                    b+1

                                                                                                                    b+1

                                                                                                                    d+1

                                                                                                                    d+1

                                                                                                                    d+c

                                                                                                                    d+c

                                                                                                                    d+c

                                                                                                                    d+c

                                                                                                                    c

                                                                                                                    c

                                                                                                                    b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                                                    Successive Band Reduction (BischofLangSun)

                                                                                                                    1

                                                                                                                    1

                                                                                                                    2

                                                                                                                    2

                                                                                                                    3

                                                                                                                    3

                                                                                                                    4

                                                                                                                    4

                                                                                                                    5

                                                                                                                    5

                                                                                                                    Q1

                                                                                                                    Q1T

                                                                                                                    Q2

                                                                                                                    Q2T

                                                                                                                    Q3

                                                                                                                    Q3T

                                                                                                                    Q4

                                                                                                                    Q4T

                                                                                                                    b+1

                                                                                                                    b+1

                                                                                                                    d+1

                                                                                                                    d+1

                                                                                                                    c

                                                                                                                    c

                                                                                                                    d+c

                                                                                                                    d+c

                                                                                                                    d+c

                                                                                                                    d+c

                                                                                                                    b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                                                    Successive Band Reduction (BischofLangSun)

                                                                                                                    1

                                                                                                                    1

                                                                                                                    2

                                                                                                                    2

                                                                                                                    3

                                                                                                                    3

                                                                                                                    4

                                                                                                                    4

                                                                                                                    5

                                                                                                                    5

                                                                                                                    Q5T

                                                                                                                    Q1

                                                                                                                    Q1T

                                                                                                                    Q2

                                                                                                                    Q2T

                                                                                                                    Q3

                                                                                                                    Q3T

                                                                                                                    Q5

                                                                                                                    Q4

                                                                                                                    Q4T

                                                                                                                    b+1

                                                                                                                    b+1

                                                                                                                    d+1

                                                                                                                    d+1

                                                                                                                    c

                                                                                                                    c

                                                                                                                    d+c

                                                                                                                    d+c

                                                                                                                    d+c

                                                                                                                    d+c

                                                                                                                    b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                                                    Successive Band Reduction (BischofLangSun)

                                                                                                                    1

                                                                                                                    1

                                                                                                                    2

                                                                                                                    2

                                                                                                                    3

                                                                                                                    3

                                                                                                                    4

                                                                                                                    4

                                                                                                                    5

                                                                                                                    5

                                                                                                                    6

                                                                                                                    6

                                                                                                                    Q5T

                                                                                                                    Q1

                                                                                                                    Q1T

                                                                                                                    Q2

                                                                                                                    Q2T

                                                                                                                    Q3

                                                                                                                    Q3T

                                                                                                                    Q5

                                                                                                                    Q4

                                                                                                                    Q4T

                                                                                                                    b+1

                                                                                                                    b+1

                                                                                                                    d+1

                                                                                                                    d+1

                                                                                                                    c

                                                                                                                    c

                                                                                                                    d+c

                                                                                                                    d+c

                                                                                                                    d+c

                                                                                                                    d+c

                                                                                                                    b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                                                    Successive Band Reduction (BischofLangSun)

                                                                                                                    Conventional vs CA - SBR

                                                                                                                    Conventional Communication-Avoiding

                                                                                                                    Touch all data 4 times Touch all data once

                                                                                                                    >
                                                                                                                    >

                                                                                                                    Speedups of Sym Band Reductionvs DSBTRD

                                                                                                                    bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

                                                                                                                    bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

                                                                                                                    bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

                                                                                                                    bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

                                                                                                                    bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

                                                                                                                    Nonsymmetric Eigenproblem

                                                                                                                    bull No apparent way to modify standard algorithmbull Instead Spectral Divide-and-Conquer

                                                                                                                    ndash Find orthogonal matrix Q whose leading columns span an invariant subspace of A

                                                                                                                    ndash QTAQ will be block upper triangularndash Apply recursively to A11 A22

                                                                                                                    ndash Depends on randomization1 Randomized Rank Revealing QR decomposition2 Randomized location to try splitting spectrum

                                                                                                                    A11 A12

                                                                                                                    ε A22

                                                                                                                    Attaining the Lower bounds SequentialLegend[Existing][Ours][Math-Lib][Random]

                                                                                                                    Two Levels Memory Hierarchy

                                                                                                                    Words Messages Words Messages

                                                                                                                    BLAS-3 [FLPRrsquo99][BDLSTrsquo13][MKL etc] [FLPRrsquo99][BDLSTrsquo13][MKL etc]

                                                                                                                    Cholesky[Grsquo97][APrsquo00]

                                                                                                                    [LAPACK][BDHSrsquo09]

                                                                                                                    [Grsquo97][APrsquo00][BDHSrsquo09] [Grsquo97][APrsquo00][BDHSrsquo09]

                                                                                                                    Sym Indefinite [BBDDDPSTYrsquo13] [BBDDDPSTYrsquo13]

                                                                                                                    LU[Grsquo97][Trsquo97]

                                                                                                                    [GDXrsquo11][BDLSTrsquo13]

                                                                                                                    [GDXrsquo11][BDLSTrsquo13]

                                                                                                                    [Grsquo97][Trsquo97] [BDLSTrsquo13] [BDLSTrsquo13]

                                                                                                                    QR[EGrsquo98][FWrsquo03]

                                                                                                                    [DGHLrsquo12][BDLSTrsquo13]

                                                                                                                    [FWrsquo03][DGHLrsquo12][BDLSTrsquo13]

                                                                                                                    [EGrsquo98][FWrsquo03][BDLSTrsquo13]

                                                                                                                    [FWrsquo03][BDLSTrsquo13]

                                                                                                                    Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

                                                                                                                    Sym Eig amp SVD [BDDrsquo11][BDKrsquo13] [BDDrsquo11]

                                                                                                                    Non Sym Eig [BDDrsquo11] [BDDrsquo11]

                                                                                                                    Legend[Existing][Ours][Math-Lib][Random]

                                                                                                                    Words (BW) Messages (L) Saving factor

                                                                                                                    BLAS-3 [AGZrsquo94][MTrsquo99][ScaLAPACK][Crsquo69][vGWrsquo97][SDrsquo11] L nP12

                                                                                                                    Cholesky [ScaLAPACK][Trsquo99][SDrsquo11] L nP12

                                                                                                                    Sym Indefinite [BBDDDPSTYrsquo13][ScaLAPACK] [BBDDDPSTYrsquo13] L nP12

                                                                                                                    LU [ScaLAPACK][GDXrsquo11][Trsquo99][SDrsquo11] [GDXrsquo11][Trsquo99][SDrsquo11] L nP12

                                                                                                                    QR [ScaLAPACK][DGHLrsquo12] [Trsquo99] [DGHLrsquo12][Trsquo99] L nP12

                                                                                                                    Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

                                                                                                                    Sym Eig amp SVD [BDDrsquo11][BDKrsquo13][ScaLAPACK] [BDDrsquo11][BDKrsquo13] L nP12

                                                                                                                    Non-Sym Eig [BDDrsquo11] [BDDrsquo11] BW P12 L n

                                                                                                                    Attaining with extra memory 25D M=(cn2P)

                                                                                                                    Attaining the Lower bounds Parallel 2DM=(n2P)(Ignoring poly-log(P) factors words = ( n2 P12) messages = (P12)

                                                                                                                    Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                    ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                    ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                    bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                    bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                    Avoiding Communication in Iterative Linear Algebra

                                                                                                                    bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

                                                                                                                    bull Conjugate Gradients (CG) GMRES Lanczos Arnoldi hellip bull Goal minimize communication

                                                                                                                    ndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

                                                                                                                    bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

                                                                                                                    ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

                                                                                                                    bull Lots of speed up possible (modeled and measured)ndash Price some redundant computationndash Challenges Poor partitioning Preconditioning Num Stability

                                                                                                                    75

                                                                                                                    Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                    ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                    ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                    bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                    bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                    Example The Difficulty of Tuning SpMV

                                                                                                                    bull n = 21200bull nnz = 15 M

                                                                                                                    bull Source NASA structural analysis problem (raefsky)

                                                                                                                    77

                                                                                                                    Example The Difficulty of Tuning

                                                                                                                    bull n = 21200bull nnz = 15 M

                                                                                                                    bull Source NASA structural analysis problem (raefsky)

                                                                                                                    bull 8x8 dense substructure exploit this to limit mem_refs

                                                                                                                    78

                                                                                                                    Speedups on Itanium 2 The Need for Search

                                                                                                                    Reference

                                                                                                                    Best 4x2

                                                                                                                    Mflops

                                                                                                                    Mflops

                                                                                                                    79

                                                                                                                    Register Profile Itanium 2

                                                                                                                    190 Mflops

                                                                                                                    1190 Mflops

                                                                                                                    80

                                                                                                                    Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16

                                                                                                                    Itanium 2 - 33Itanium 1 - 8

                                                                                                                    252 Mflops

                                                                                                                    122 Mflops

                                                                                                                    820 Mflops

                                                                                                                    459 Mflops

                                                                                                                    247 Mflops

                                                                                                                    107 Mflops

                                                                                                                    12 Gflops

                                                                                                                    190 Mflops

                                                                                                                    Another example of tuning challenges for SpMV

                                                                                                                    bull Ex11 matrix (fluid flow)

                                                                                                                    bull More complicated non-zero structure in general

                                                                                                                    bull N = 16614bull NNZ = 11M

                                                                                                                    82

                                                                                                                    Zoom in to top corner

                                                                                                                    bull More complicated non-zero structure in general

                                                                                                                    bull N = 16614bull NNZ = 11M

                                                                                                                    83

                                                                                                                    3x3 blocks look natural buthellip

                                                                                                                    bull Example 3x3 blockingndash Logical grid of 3x3 cells

                                                                                                                    bull But would lead to lots of ldquofill-inrdquo

                                                                                                                    84

                                                                                                                    Extra Work Can Improve Efficiency

                                                                                                                    bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

                                                                                                                    bull On Pentium III 15x speedupndash Actual mflop rate 152 = 225 higher

                                                                                                                    85

                                                                                                                    Source Accelerator Cavity Design Problem (Ko via Husbands)

                                                                                                                    86

                                                                                                                    100x100 Submatrix Along Diagonal

                                                                                                                    Summer School Lecture 787

                                                                                                                    Post-RCM Reordering

                                                                                                                    88

                                                                                                                    Effect of Combined RCM+TSP Reordering

                                                                                                                    Before Green + RedAfter Green + Blue

                                                                                                                    Summer School Lecture 789

                                                                                                                    2x speedups on Pentium 4 Power 4 hellip

                                                                                                                    Summary of Other Performance Optimizations

                                                                                                                    bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

                                                                                                                    bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

                                                                                                                    bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

                                                                                                                    90

                                                                                                                    Optimized Sparse Kernel Interface - OSKI

                                                                                                                    bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

                                                                                                                    bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

                                                                                                                    bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

                                                                                                                    software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

                                                                                                                    91

                                                                                                                    Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                    ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                    ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                    bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                    bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                    93

                                                                                                                    Example Classical Conjugate Gradient (CG)

                                                                                                                    SpMVs and dot products require communication in

                                                                                                                    each iteration

                                                                                                                    via CA Matrix Powers Kernel

                                                                                                                    Global reduction to compute G

                                                                                                                    94

                                                                                                                    Example CA-Conjugate Gradient

                                                                                                                    Local computations within inner loop require

                                                                                                                    no communication

                                                                                                                    Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                    ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                    ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                    bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                    bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                    96

                                                                                                                    Slower convergence due

                                                                                                                    to roundoff

                                                                                                                    Loss of accuracy due to roundoff

                                                                                                                    At s = 16 monomial basis is rank deficient Method breaks down

                                                                                                                    Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

                                                                                                                    CA-CG (monomial)CG

                                                                                                                    machine precision

                                                                                                                    97

                                                                                                                    Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                    ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                    ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                    bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                    bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                    What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

                                                                                                                    bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

                                                                                                                    bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

                                                                                                                    matrices

                                                                                                                    Explicit (O(nnz)) Implicit (o(nnz))

                                                                                                                    Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

                                                                                                                    Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

                                                                                                                    Indices

                                                                                                                    Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                    ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                    ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                    bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                    bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                    101

                                                                                                                    bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

                                                                                                                    ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

                                                                                                                    demanded by customers (construction engineers) otherwise they donrsquot believe results

                                                                                                                    ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

                                                                                                                    ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

                                                                                                                    Reproducible Floating Point Computation

                                                                                                                    Absolute Error for Random Vectors

                                                                                                                    Same magnitude opposite signs

                                                                                                                    Intel MKL non-reproducibility

                                                                                                                    Relative Error for Orthogonal vectors

                                                                                                                    Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

                                                                                                                    Sign notreproducible

                                                                                                                    103

                                                                                                                    bull Consider summation or dot productbull Goals

                                                                                                                    1 Same answer independent of layout processors order of summands

                                                                                                                    2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

                                                                                                                    bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

                                                                                                                    GoalsApproaches for Reproducibility

                                                                                                                    104

                                                                                                                    Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

                                                                                                                    Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

                                                                                                                    Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

                                                                                                                    bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

                                                                                                                    Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

                                                                                                                    bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

                                                                                                                    Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

                                                                                                                    Instruments NEC Nokia NVIDIA Samsung Oracle

                                                                                                                    bull bebopcsberkeleyedu

                                                                                                                    Summary

                                                                                                                    Donrsquot Communichellip

                                                                                                                    106

                                                                                                                    Time to redesign all linear algebra n-body hellip algorithms and software

                                                                                                                    (and compilers)

                                                                                                                    • Implementing Communication-Avoiding Algorithms
                                                                                                                    • Why avoid communication
                                                                                                                    • Goals
                                                                                                                    • Outline
                                                                                                                    • Outline (2)
                                                                                                                    • Lower bound for all ldquon3-likerdquo linear algebra
                                                                                                                    • Lower bound for all ldquon3-likerdquo linear algebra (2)
                                                                                                                    • Lower bound for all ldquon3-likerdquo linear algebra (3)
                                                                                                                    • Limits to parallel scaling (12)
                                                                                                                    • Limits to parallel scaling (22)
                                                                                                                    • Can we attain these lower bounds
                                                                                                                    • Outline (3)
                                                                                                                    • 25D Matrix Multiplication
                                                                                                                    • 25D Matrix Multiplication (2)
                                                                                                                    • 25D Matmul on BGP 16K nodes 64K cores (2)
                                                                                                                    • Perfect Strong Scaling ndash in Time and Energy (12)
                                                                                                                    • Perfect Strong Scaling ndash in Time and Energy (22)
                                                                                                                    • Handling Heterogeneity
                                                                                                                    • Application to Tensor Contractions
                                                                                                                    • C(ijk) = Σm A(ijm)B(mk)
                                                                                                                    • Application to Tensor Contractions (2)
                                                                                                                    • Communication Lower Bounds for Strassen-like matmul algorithms
                                                                                                                    • vs
                                                                                                                    • Slide 26
                                                                                                                    • Strassen-like beyond matmul
                                                                                                                    • Cache and Network Oblivious Algorithms
                                                                                                                    • CARMA Performance Distributed Memory
                                                                                                                    • CARMA Performance Distributed Memory (2)
                                                                                                                    • CARMA Performance Shared Memory
                                                                                                                    • CARMA Performance Shared Memory (2)
                                                                                                                    • Why is CARMA Faster in Shared Memory
                                                                                                                    • Outline (4)
                                                                                                                    • One-sided Factorizations (LU QR) so far
                                                                                                                    • TSQR An Architecture-Dependent Algorithm
                                                                                                                    • Back to LU Using similar idea for TSLU as TSQR Use reduction
                                                                                                                    • Minimizing Communication in TSLU
                                                                                                                    • Making TSLU Numerically Stable
                                                                                                                    • Stability of LU using TSLU CALU
                                                                                                                    • Why is stability of TSLU just a ldquoThmrdquo
                                                                                                                    • Fixing TSLU
                                                                                                                    • 2D CALU with Tournament Pivoting
                                                                                                                    • 25D CALU with Tournament Pivoting (c=4 copies)
                                                                                                                    • Exascale Machine Parameters Source DOE Exascale Workshop
                                                                                                                    • Exascale predicted speedups for Gaussian Elimination 2D CA
                                                                                                                    • 25D vs 2D LU With and Without Pivoting
                                                                                                                    • Other CA algorithms for Ax=b least squares(13)
                                                                                                                    • Other CA algorithms for Ax=b least squares (23)
                                                                                                                    • Other CA algorithms for Ax=b least squares (33)
                                                                                                                    • Outline (5)
                                                                                                                    • What about sparse matrices (13)
                                                                                                                    • Performance of 25D APSP using Kleene
                                                                                                                    • What about sparse matrices (23)
                                                                                                                    • What about sparse matrices (33)
                                                                                                                    • Outline (6)
                                                                                                                    • Symmetric Eigenproblem and SVD
                                                                                                                    • Slide 58
                                                                                                                    • Slide 59
                                                                                                                    • Slide 60
                                                                                                                    • Slide 61
                                                                                                                    • Slide 62
                                                                                                                    • Slide 63
                                                                                                                    • Slide 64
                                                                                                                    • Slide 65
                                                                                                                    • Slide 66
                                                                                                                    • Slide 67
                                                                                                                    • Slide 68
                                                                                                                    • Conventional vs CA - SBR
                                                                                                                    • Speedups of Sym Band Reduction vs DSBTRD
                                                                                                                    • Nonsymmetric Eigenproblem
                                                                                                                    • Attaining the Lower bounds Sequential
                                                                                                                    • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
                                                                                                                    • Outline (7)
                                                                                                                    • Avoiding Communication in Iterative Linear Algebra
                                                                                                                    • Outline (8)
                                                                                                                    • Example The Difficulty of Tuning SpMV
                                                                                                                    • Example The Difficulty of Tuning
                                                                                                                    • Speedups on Itanium 2 The Need for Search
                                                                                                                    • Register Profile Itanium 2
                                                                                                                    • Register Profiles IBM and Intel IA-64
                                                                                                                    • Another example of tuning challenges for SpMV
                                                                                                                    • Zoom in to top corner
                                                                                                                    • 3x3 blocks look natural buthellip
                                                                                                                    • Extra Work Can Improve Efficiency
                                                                                                                    • Slide 86
                                                                                                                    • Slide 87
                                                                                                                    • Slide 88
                                                                                                                    • Slide 89
                                                                                                                    • Summary of Other Performance Optimizations
                                                                                                                    • Optimized Sparse Kernel Interface - OSKI
                                                                                                                    • Outline (9)
                                                                                                                    • Example Classical Conjugate Gradient (CG)
                                                                                                                    • Example CA-Conjugate Gradient
                                                                                                                    • Outline (10)
                                                                                                                    • Slide 96
                                                                                                                    • Slide 97
                                                                                                                    • Outline (11)
                                                                                                                    • What is a ldquosparse matrixrdquo
                                                                                                                    • Outline (12)
                                                                                                                    • Reproducible Floating Point Computation
                                                                                                                    • Intel MKL non-reproducibility
                                                                                                                    • GoalsApproaches for Reproducibility
                                                                                                                    • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
                                                                                                                    • Collaborators and Supporters
                                                                                                                    • Summary

                                                                                                                      12

                                                                                                                      Q1

                                                                                                                      b+1

                                                                                                                      b+1

                                                                                                                      d+1

                                                                                                                      d+c

                                                                                                                      d+c

                                                                                                                      c

                                                                                                                      b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                                                      Successive Band Reduction (BischofLangSun)

                                                                                                                      1

                                                                                                                      12

                                                                                                                      Q1

                                                                                                                      Q1T

                                                                                                                      b+1

                                                                                                                      b+1

                                                                                                                      d+1

                                                                                                                      d+1

                                                                                                                      cd+c

                                                                                                                      d+c

                                                                                                                      c

                                                                                                                      b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                                                      Successive Band Reduction (BischofLangSun)

                                                                                                                      1

                                                                                                                      1

                                                                                                                      2

                                                                                                                      2Q1

                                                                                                                      Q1T

                                                                                                                      b+1

                                                                                                                      b+1

                                                                                                                      d+1

                                                                                                                      d+1

                                                                                                                      cd+c

                                                                                                                      d+c

                                                                                                                      d+c

                                                                                                                      d+c

                                                                                                                      c

                                                                                                                      b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                                                      Successive Band Reduction (BischofLangSun)

                                                                                                                      1

                                                                                                                      1

                                                                                                                      2

                                                                                                                      2

                                                                                                                      3

                                                                                                                      3

                                                                                                                      Q1

                                                                                                                      Q1T

                                                                                                                      Q2

                                                                                                                      Q2T

                                                                                                                      b+1

                                                                                                                      b+1

                                                                                                                      d+1

                                                                                                                      d+1

                                                                                                                      d+c

                                                                                                                      d+c

                                                                                                                      d+c

                                                                                                                      d+c

                                                                                                                      c

                                                                                                                      c

                                                                                                                      b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                                                      Successive Band Reduction (BischofLangSun)

                                                                                                                      1

                                                                                                                      1

                                                                                                                      2

                                                                                                                      2

                                                                                                                      3

                                                                                                                      3

                                                                                                                      4

                                                                                                                      4

                                                                                                                      Q1

                                                                                                                      Q1T

                                                                                                                      Q2

                                                                                                                      Q2T

                                                                                                                      Q3

                                                                                                                      Q3T

                                                                                                                      b+1

                                                                                                                      b+1

                                                                                                                      d+1

                                                                                                                      d+1

                                                                                                                      d+c

                                                                                                                      d+c

                                                                                                                      d+c

                                                                                                                      d+c

                                                                                                                      c

                                                                                                                      c

                                                                                                                      b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                                                      Successive Band Reduction (BischofLangSun)

                                                                                                                      1

                                                                                                                      1

                                                                                                                      2

                                                                                                                      2

                                                                                                                      3

                                                                                                                      3

                                                                                                                      4

                                                                                                                      4

                                                                                                                      5

                                                                                                                      5

                                                                                                                      Q1

                                                                                                                      Q1T

                                                                                                                      Q2

                                                                                                                      Q2T

                                                                                                                      Q3

                                                                                                                      Q3T

                                                                                                                      Q4

                                                                                                                      Q4T

                                                                                                                      b+1

                                                                                                                      b+1

                                                                                                                      d+1

                                                                                                                      d+1

                                                                                                                      c

                                                                                                                      c

                                                                                                                      d+c

                                                                                                                      d+c

                                                                                                                      d+c

                                                                                                                      d+c

                                                                                                                      b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                                                      Successive Band Reduction (BischofLangSun)

                                                                                                                      1

                                                                                                                      1

                                                                                                                      2

                                                                                                                      2

                                                                                                                      3

                                                                                                                      3

                                                                                                                      4

                                                                                                                      4

                                                                                                                      5

                                                                                                                      5

                                                                                                                      Q5T

                                                                                                                      Q1

                                                                                                                      Q1T

                                                                                                                      Q2

                                                                                                                      Q2T

                                                                                                                      Q3

                                                                                                                      Q3T

                                                                                                                      Q5

                                                                                                                      Q4

                                                                                                                      Q4T

                                                                                                                      b+1

                                                                                                                      b+1

                                                                                                                      d+1

                                                                                                                      d+1

                                                                                                                      c

                                                                                                                      c

                                                                                                                      d+c

                                                                                                                      d+c

                                                                                                                      d+c

                                                                                                                      d+c

                                                                                                                      b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                                                      Successive Band Reduction (BischofLangSun)

                                                                                                                      1

                                                                                                                      1

                                                                                                                      2

                                                                                                                      2

                                                                                                                      3

                                                                                                                      3

                                                                                                                      4

                                                                                                                      4

                                                                                                                      5

                                                                                                                      5

                                                                                                                      6

                                                                                                                      6

                                                                                                                      Q5T

                                                                                                                      Q1

                                                                                                                      Q1T

                                                                                                                      Q2

                                                                                                                      Q2T

                                                                                                                      Q3

                                                                                                                      Q3T

                                                                                                                      Q5

                                                                                                                      Q4

                                                                                                                      Q4T

                                                                                                                      b+1

                                                                                                                      b+1

                                                                                                                      d+1

                                                                                                                      d+1

                                                                                                                      c

                                                                                                                      c

                                                                                                                      d+c

                                                                                                                      d+c

                                                                                                                      d+c

                                                                                                                      d+c

                                                                                                                      b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                                                      Successive Band Reduction (BischofLangSun)

                                                                                                                      Conventional vs CA - SBR

                                                                                                                      Conventional Communication-Avoiding

                                                                                                                      Touch all data 4 times Touch all data once

                                                                                                                      >
                                                                                                                      >

                                                                                                                      Speedups of Sym Band Reductionvs DSBTRD

                                                                                                                      bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

                                                                                                                      bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

                                                                                                                      bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

                                                                                                                      bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

                                                                                                                      bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

                                                                                                                      Nonsymmetric Eigenproblem

                                                                                                                      bull No apparent way to modify standard algorithmbull Instead Spectral Divide-and-Conquer

                                                                                                                      ndash Find orthogonal matrix Q whose leading columns span an invariant subspace of A

                                                                                                                      ndash QTAQ will be block upper triangularndash Apply recursively to A11 A22

                                                                                                                      ndash Depends on randomization1 Randomized Rank Revealing QR decomposition2 Randomized location to try splitting spectrum

                                                                                                                      A11 A12

                                                                                                                      ε A22

                                                                                                                      Attaining the Lower bounds SequentialLegend[Existing][Ours][Math-Lib][Random]

                                                                                                                      Two Levels Memory Hierarchy

                                                                                                                      Words Messages Words Messages

                                                                                                                      BLAS-3 [FLPRrsquo99][BDLSTrsquo13][MKL etc] [FLPRrsquo99][BDLSTrsquo13][MKL etc]

                                                                                                                      Cholesky[Grsquo97][APrsquo00]

                                                                                                                      [LAPACK][BDHSrsquo09]

                                                                                                                      [Grsquo97][APrsquo00][BDHSrsquo09] [Grsquo97][APrsquo00][BDHSrsquo09]

                                                                                                                      Sym Indefinite [BBDDDPSTYrsquo13] [BBDDDPSTYrsquo13]

                                                                                                                      LU[Grsquo97][Trsquo97]

                                                                                                                      [GDXrsquo11][BDLSTrsquo13]

                                                                                                                      [GDXrsquo11][BDLSTrsquo13]

                                                                                                                      [Grsquo97][Trsquo97] [BDLSTrsquo13] [BDLSTrsquo13]

                                                                                                                      QR[EGrsquo98][FWrsquo03]

                                                                                                                      [DGHLrsquo12][BDLSTrsquo13]

                                                                                                                      [FWrsquo03][DGHLrsquo12][BDLSTrsquo13]

                                                                                                                      [EGrsquo98][FWrsquo03][BDLSTrsquo13]

                                                                                                                      [FWrsquo03][BDLSTrsquo13]

                                                                                                                      Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

                                                                                                                      Sym Eig amp SVD [BDDrsquo11][BDKrsquo13] [BDDrsquo11]

                                                                                                                      Non Sym Eig [BDDrsquo11] [BDDrsquo11]

                                                                                                                      Legend[Existing][Ours][Math-Lib][Random]

                                                                                                                      Words (BW) Messages (L) Saving factor

                                                                                                                      BLAS-3 [AGZrsquo94][MTrsquo99][ScaLAPACK][Crsquo69][vGWrsquo97][SDrsquo11] L nP12

                                                                                                                      Cholesky [ScaLAPACK][Trsquo99][SDrsquo11] L nP12

                                                                                                                      Sym Indefinite [BBDDDPSTYrsquo13][ScaLAPACK] [BBDDDPSTYrsquo13] L nP12

                                                                                                                      LU [ScaLAPACK][GDXrsquo11][Trsquo99][SDrsquo11] [GDXrsquo11][Trsquo99][SDrsquo11] L nP12

                                                                                                                      QR [ScaLAPACK][DGHLrsquo12] [Trsquo99] [DGHLrsquo12][Trsquo99] L nP12

                                                                                                                      Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

                                                                                                                      Sym Eig amp SVD [BDDrsquo11][BDKrsquo13][ScaLAPACK] [BDDrsquo11][BDKrsquo13] L nP12

                                                                                                                      Non-Sym Eig [BDDrsquo11] [BDDrsquo11] BW P12 L n

                                                                                                                      Attaining with extra memory 25D M=(cn2P)

                                                                                                                      Attaining the Lower bounds Parallel 2DM=(n2P)(Ignoring poly-log(P) factors words = ( n2 P12) messages = (P12)

                                                                                                                      Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                      ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                      ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                      bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                      bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                      Avoiding Communication in Iterative Linear Algebra

                                                                                                                      bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

                                                                                                                      bull Conjugate Gradients (CG) GMRES Lanczos Arnoldi hellip bull Goal minimize communication

                                                                                                                      ndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

                                                                                                                      bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

                                                                                                                      ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

                                                                                                                      bull Lots of speed up possible (modeled and measured)ndash Price some redundant computationndash Challenges Poor partitioning Preconditioning Num Stability

                                                                                                                      75

                                                                                                                      Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                      ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                      ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                      bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                      bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                      Example The Difficulty of Tuning SpMV

                                                                                                                      bull n = 21200bull nnz = 15 M

                                                                                                                      bull Source NASA structural analysis problem (raefsky)

                                                                                                                      77

                                                                                                                      Example The Difficulty of Tuning

                                                                                                                      bull n = 21200bull nnz = 15 M

                                                                                                                      bull Source NASA structural analysis problem (raefsky)

                                                                                                                      bull 8x8 dense substructure exploit this to limit mem_refs

                                                                                                                      78

                                                                                                                      Speedups on Itanium 2 The Need for Search

                                                                                                                      Reference

                                                                                                                      Best 4x2

                                                                                                                      Mflops

                                                                                                                      Mflops

                                                                                                                      79

                                                                                                                      Register Profile Itanium 2

                                                                                                                      190 Mflops

                                                                                                                      1190 Mflops

                                                                                                                      80

                                                                                                                      Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16

                                                                                                                      Itanium 2 - 33Itanium 1 - 8

                                                                                                                      252 Mflops

                                                                                                                      122 Mflops

                                                                                                                      820 Mflops

                                                                                                                      459 Mflops

                                                                                                                      247 Mflops

                                                                                                                      107 Mflops

                                                                                                                      12 Gflops

                                                                                                                      190 Mflops

                                                                                                                      Another example of tuning challenges for SpMV

                                                                                                                      bull Ex11 matrix (fluid flow)

                                                                                                                      bull More complicated non-zero structure in general

                                                                                                                      bull N = 16614bull NNZ = 11M

                                                                                                                      82

                                                                                                                      Zoom in to top corner

                                                                                                                      bull More complicated non-zero structure in general

                                                                                                                      bull N = 16614bull NNZ = 11M

                                                                                                                      83

                                                                                                                      3x3 blocks look natural buthellip

                                                                                                                      bull Example 3x3 blockingndash Logical grid of 3x3 cells

                                                                                                                      bull But would lead to lots of ldquofill-inrdquo

                                                                                                                      84

                                                                                                                      Extra Work Can Improve Efficiency

                                                                                                                      bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

                                                                                                                      bull On Pentium III 15x speedupndash Actual mflop rate 152 = 225 higher

                                                                                                                      85

                                                                                                                      Source Accelerator Cavity Design Problem (Ko via Husbands)

                                                                                                                      86

                                                                                                                      100x100 Submatrix Along Diagonal

                                                                                                                      Summer School Lecture 787

                                                                                                                      Post-RCM Reordering

                                                                                                                      88

                                                                                                                      Effect of Combined RCM+TSP Reordering

                                                                                                                      Before Green + RedAfter Green + Blue

                                                                                                                      Summer School Lecture 789

                                                                                                                      2x speedups on Pentium 4 Power 4 hellip

                                                                                                                      Summary of Other Performance Optimizations

                                                                                                                      bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

                                                                                                                      bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

                                                                                                                      bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

                                                                                                                      90

                                                                                                                      Optimized Sparse Kernel Interface - OSKI

                                                                                                                      bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

                                                                                                                      bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

                                                                                                                      bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

                                                                                                                      software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

                                                                                                                      91

                                                                                                                      Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                      ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                      ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                      bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                      bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                      93

                                                                                                                      Example Classical Conjugate Gradient (CG)

                                                                                                                      SpMVs and dot products require communication in

                                                                                                                      each iteration

                                                                                                                      via CA Matrix Powers Kernel

                                                                                                                      Global reduction to compute G

                                                                                                                      94

                                                                                                                      Example CA-Conjugate Gradient

                                                                                                                      Local computations within inner loop require

                                                                                                                      no communication

                                                                                                                      Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                      ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                      ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                      bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                      bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                      96

                                                                                                                      Slower convergence due

                                                                                                                      to roundoff

                                                                                                                      Loss of accuracy due to roundoff

                                                                                                                      At s = 16 monomial basis is rank deficient Method breaks down

                                                                                                                      Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

                                                                                                                      CA-CG (monomial)CG

                                                                                                                      machine precision

                                                                                                                      97

                                                                                                                      Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                      ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                      ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                      bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                      bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                      What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

                                                                                                                      bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

                                                                                                                      bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

                                                                                                                      matrices

                                                                                                                      Explicit (O(nnz)) Implicit (o(nnz))

                                                                                                                      Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

                                                                                                                      Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

                                                                                                                      Indices

                                                                                                                      Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                      ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                      ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                      bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                      bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                      101

                                                                                                                      bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

                                                                                                                      ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

                                                                                                                      demanded by customers (construction engineers) otherwise they donrsquot believe results

                                                                                                                      ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

                                                                                                                      ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

                                                                                                                      Reproducible Floating Point Computation

                                                                                                                      Absolute Error for Random Vectors

                                                                                                                      Same magnitude opposite signs

                                                                                                                      Intel MKL non-reproducibility

                                                                                                                      Relative Error for Orthogonal vectors

                                                                                                                      Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

                                                                                                                      Sign notreproducible

                                                                                                                      103

                                                                                                                      bull Consider summation or dot productbull Goals

                                                                                                                      1 Same answer independent of layout processors order of summands

                                                                                                                      2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

                                                                                                                      bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

                                                                                                                      GoalsApproaches for Reproducibility

                                                                                                                      104

                                                                                                                      Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

                                                                                                                      Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

                                                                                                                      Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

                                                                                                                      bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

                                                                                                                      Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

                                                                                                                      bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

                                                                                                                      Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

                                                                                                                      Instruments NEC Nokia NVIDIA Samsung Oracle

                                                                                                                      bull bebopcsberkeleyedu

                                                                                                                      Summary

                                                                                                                      Donrsquot Communichellip

                                                                                                                      106

                                                                                                                      Time to redesign all linear algebra n-body hellip algorithms and software

                                                                                                                      (and compilers)

                                                                                                                      • Implementing Communication-Avoiding Algorithms
                                                                                                                      • Why avoid communication
                                                                                                                      • Goals
                                                                                                                      • Outline
                                                                                                                      • Outline (2)
                                                                                                                      • Lower bound for all ldquon3-likerdquo linear algebra
                                                                                                                      • Lower bound for all ldquon3-likerdquo linear algebra (2)
                                                                                                                      • Lower bound for all ldquon3-likerdquo linear algebra (3)
                                                                                                                      • Limits to parallel scaling (12)
                                                                                                                      • Limits to parallel scaling (22)
                                                                                                                      • Can we attain these lower bounds
                                                                                                                      • Outline (3)
                                                                                                                      • 25D Matrix Multiplication
                                                                                                                      • 25D Matrix Multiplication (2)
                                                                                                                      • 25D Matmul on BGP 16K nodes 64K cores (2)
                                                                                                                      • Perfect Strong Scaling ndash in Time and Energy (12)
                                                                                                                      • Perfect Strong Scaling ndash in Time and Energy (22)
                                                                                                                      • Handling Heterogeneity
                                                                                                                      • Application to Tensor Contractions
                                                                                                                      • C(ijk) = Σm A(ijm)B(mk)
                                                                                                                      • Application to Tensor Contractions (2)
                                                                                                                      • Communication Lower Bounds for Strassen-like matmul algorithms
                                                                                                                      • vs
                                                                                                                      • Slide 26
                                                                                                                      • Strassen-like beyond matmul
                                                                                                                      • Cache and Network Oblivious Algorithms
                                                                                                                      • CARMA Performance Distributed Memory
                                                                                                                      • CARMA Performance Distributed Memory (2)
                                                                                                                      • CARMA Performance Shared Memory
                                                                                                                      • CARMA Performance Shared Memory (2)
                                                                                                                      • Why is CARMA Faster in Shared Memory
                                                                                                                      • Outline (4)
                                                                                                                      • One-sided Factorizations (LU QR) so far
                                                                                                                      • TSQR An Architecture-Dependent Algorithm
                                                                                                                      • Back to LU Using similar idea for TSLU as TSQR Use reduction
                                                                                                                      • Minimizing Communication in TSLU
                                                                                                                      • Making TSLU Numerically Stable
                                                                                                                      • Stability of LU using TSLU CALU
                                                                                                                      • Why is stability of TSLU just a ldquoThmrdquo
                                                                                                                      • Fixing TSLU
                                                                                                                      • 2D CALU with Tournament Pivoting
                                                                                                                      • 25D CALU with Tournament Pivoting (c=4 copies)
                                                                                                                      • Exascale Machine Parameters Source DOE Exascale Workshop
                                                                                                                      • Exascale predicted speedups for Gaussian Elimination 2D CA
                                                                                                                      • 25D vs 2D LU With and Without Pivoting
                                                                                                                      • Other CA algorithms for Ax=b least squares(13)
                                                                                                                      • Other CA algorithms for Ax=b least squares (23)
                                                                                                                      • Other CA algorithms for Ax=b least squares (33)
                                                                                                                      • Outline (5)
                                                                                                                      • What about sparse matrices (13)
                                                                                                                      • Performance of 25D APSP using Kleene
                                                                                                                      • What about sparse matrices (23)
                                                                                                                      • What about sparse matrices (33)
                                                                                                                      • Outline (6)
                                                                                                                      • Symmetric Eigenproblem and SVD
                                                                                                                      • Slide 58
                                                                                                                      • Slide 59
                                                                                                                      • Slide 60
                                                                                                                      • Slide 61
                                                                                                                      • Slide 62
                                                                                                                      • Slide 63
                                                                                                                      • Slide 64
                                                                                                                      • Slide 65
                                                                                                                      • Slide 66
                                                                                                                      • Slide 67
                                                                                                                      • Slide 68
                                                                                                                      • Conventional vs CA - SBR
                                                                                                                      • Speedups of Sym Band Reduction vs DSBTRD
                                                                                                                      • Nonsymmetric Eigenproblem
                                                                                                                      • Attaining the Lower bounds Sequential
                                                                                                                      • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
                                                                                                                      • Outline (7)
                                                                                                                      • Avoiding Communication in Iterative Linear Algebra
                                                                                                                      • Outline (8)
                                                                                                                      • Example The Difficulty of Tuning SpMV
                                                                                                                      • Example The Difficulty of Tuning
                                                                                                                      • Speedups on Itanium 2 The Need for Search
                                                                                                                      • Register Profile Itanium 2
                                                                                                                      • Register Profiles IBM and Intel IA-64
                                                                                                                      • Another example of tuning challenges for SpMV
                                                                                                                      • Zoom in to top corner
                                                                                                                      • 3x3 blocks look natural buthellip
                                                                                                                      • Extra Work Can Improve Efficiency
                                                                                                                      • Slide 86
                                                                                                                      • Slide 87
                                                                                                                      • Slide 88
                                                                                                                      • Slide 89
                                                                                                                      • Summary of Other Performance Optimizations
                                                                                                                      • Optimized Sparse Kernel Interface - OSKI
                                                                                                                      • Outline (9)
                                                                                                                      • Example Classical Conjugate Gradient (CG)
                                                                                                                      • Example CA-Conjugate Gradient
                                                                                                                      • Outline (10)
                                                                                                                      • Slide 96
                                                                                                                      • Slide 97
                                                                                                                      • Outline (11)
                                                                                                                      • What is a ldquosparse matrixrdquo
                                                                                                                      • Outline (12)
                                                                                                                      • Reproducible Floating Point Computation
                                                                                                                      • Intel MKL non-reproducibility
                                                                                                                      • GoalsApproaches for Reproducibility
                                                                                                                      • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
                                                                                                                      • Collaborators and Supporters
                                                                                                                      • Summary

                                                                                                                        1

                                                                                                                        12

                                                                                                                        Q1

                                                                                                                        Q1T

                                                                                                                        b+1

                                                                                                                        b+1

                                                                                                                        d+1

                                                                                                                        d+1

                                                                                                                        cd+c

                                                                                                                        d+c

                                                                                                                        c

                                                                                                                        b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                                                        Successive Band Reduction (BischofLangSun)

                                                                                                                        1

                                                                                                                        1

                                                                                                                        2

                                                                                                                        2Q1

                                                                                                                        Q1T

                                                                                                                        b+1

                                                                                                                        b+1

                                                                                                                        d+1

                                                                                                                        d+1

                                                                                                                        cd+c

                                                                                                                        d+c

                                                                                                                        d+c

                                                                                                                        d+c

                                                                                                                        c

                                                                                                                        b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                                                        Successive Band Reduction (BischofLangSun)

                                                                                                                        1

                                                                                                                        1

                                                                                                                        2

                                                                                                                        2

                                                                                                                        3

                                                                                                                        3

                                                                                                                        Q1

                                                                                                                        Q1T

                                                                                                                        Q2

                                                                                                                        Q2T

                                                                                                                        b+1

                                                                                                                        b+1

                                                                                                                        d+1

                                                                                                                        d+1

                                                                                                                        d+c

                                                                                                                        d+c

                                                                                                                        d+c

                                                                                                                        d+c

                                                                                                                        c

                                                                                                                        c

                                                                                                                        b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                                                        Successive Band Reduction (BischofLangSun)

                                                                                                                        1

                                                                                                                        1

                                                                                                                        2

                                                                                                                        2

                                                                                                                        3

                                                                                                                        3

                                                                                                                        4

                                                                                                                        4

                                                                                                                        Q1

                                                                                                                        Q1T

                                                                                                                        Q2

                                                                                                                        Q2T

                                                                                                                        Q3

                                                                                                                        Q3T

                                                                                                                        b+1

                                                                                                                        b+1

                                                                                                                        d+1

                                                                                                                        d+1

                                                                                                                        d+c

                                                                                                                        d+c

                                                                                                                        d+c

                                                                                                                        d+c

                                                                                                                        c

                                                                                                                        c

                                                                                                                        b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                                                        Successive Band Reduction (BischofLangSun)

                                                                                                                        1

                                                                                                                        1

                                                                                                                        2

                                                                                                                        2

                                                                                                                        3

                                                                                                                        3

                                                                                                                        4

                                                                                                                        4

                                                                                                                        5

                                                                                                                        5

                                                                                                                        Q1

                                                                                                                        Q1T

                                                                                                                        Q2

                                                                                                                        Q2T

                                                                                                                        Q3

                                                                                                                        Q3T

                                                                                                                        Q4

                                                                                                                        Q4T

                                                                                                                        b+1

                                                                                                                        b+1

                                                                                                                        d+1

                                                                                                                        d+1

                                                                                                                        c

                                                                                                                        c

                                                                                                                        d+c

                                                                                                                        d+c

                                                                                                                        d+c

                                                                                                                        d+c

                                                                                                                        b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                                                        Successive Band Reduction (BischofLangSun)

                                                                                                                        1

                                                                                                                        1

                                                                                                                        2

                                                                                                                        2

                                                                                                                        3

                                                                                                                        3

                                                                                                                        4

                                                                                                                        4

                                                                                                                        5

                                                                                                                        5

                                                                                                                        Q5T

                                                                                                                        Q1

                                                                                                                        Q1T

                                                                                                                        Q2

                                                                                                                        Q2T

                                                                                                                        Q3

                                                                                                                        Q3T

                                                                                                                        Q5

                                                                                                                        Q4

                                                                                                                        Q4T

                                                                                                                        b+1

                                                                                                                        b+1

                                                                                                                        d+1

                                                                                                                        d+1

                                                                                                                        c

                                                                                                                        c

                                                                                                                        d+c

                                                                                                                        d+c

                                                                                                                        d+c

                                                                                                                        d+c

                                                                                                                        b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                                                        Successive Band Reduction (BischofLangSun)

                                                                                                                        1

                                                                                                                        1

                                                                                                                        2

                                                                                                                        2

                                                                                                                        3

                                                                                                                        3

                                                                                                                        4

                                                                                                                        4

                                                                                                                        5

                                                                                                                        5

                                                                                                                        6

                                                                                                                        6

                                                                                                                        Q5T

                                                                                                                        Q1

                                                                                                                        Q1T

                                                                                                                        Q2

                                                                                                                        Q2T

                                                                                                                        Q3

                                                                                                                        Q3T

                                                                                                                        Q5

                                                                                                                        Q4

                                                                                                                        Q4T

                                                                                                                        b+1

                                                                                                                        b+1

                                                                                                                        d+1

                                                                                                                        d+1

                                                                                                                        c

                                                                                                                        c

                                                                                                                        d+c

                                                                                                                        d+c

                                                                                                                        d+c

                                                                                                                        d+c

                                                                                                                        b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                                                        Successive Band Reduction (BischofLangSun)

                                                                                                                        Conventional vs CA - SBR

                                                                                                                        Conventional Communication-Avoiding

                                                                                                                        Touch all data 4 times Touch all data once

                                                                                                                        >
                                                                                                                        >

                                                                                                                        Speedups of Sym Band Reductionvs DSBTRD

                                                                                                                        bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

                                                                                                                        bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

                                                                                                                        bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

                                                                                                                        bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

                                                                                                                        bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

                                                                                                                        Nonsymmetric Eigenproblem

                                                                                                                        bull No apparent way to modify standard algorithmbull Instead Spectral Divide-and-Conquer

                                                                                                                        ndash Find orthogonal matrix Q whose leading columns span an invariant subspace of A

                                                                                                                        ndash QTAQ will be block upper triangularndash Apply recursively to A11 A22

                                                                                                                        ndash Depends on randomization1 Randomized Rank Revealing QR decomposition2 Randomized location to try splitting spectrum

                                                                                                                        A11 A12

                                                                                                                        ε A22

                                                                                                                        Attaining the Lower bounds SequentialLegend[Existing][Ours][Math-Lib][Random]

                                                                                                                        Two Levels Memory Hierarchy

                                                                                                                        Words Messages Words Messages

                                                                                                                        BLAS-3 [FLPRrsquo99][BDLSTrsquo13][MKL etc] [FLPRrsquo99][BDLSTrsquo13][MKL etc]

                                                                                                                        Cholesky[Grsquo97][APrsquo00]

                                                                                                                        [LAPACK][BDHSrsquo09]

                                                                                                                        [Grsquo97][APrsquo00][BDHSrsquo09] [Grsquo97][APrsquo00][BDHSrsquo09]

                                                                                                                        Sym Indefinite [BBDDDPSTYrsquo13] [BBDDDPSTYrsquo13]

                                                                                                                        LU[Grsquo97][Trsquo97]

                                                                                                                        [GDXrsquo11][BDLSTrsquo13]

                                                                                                                        [GDXrsquo11][BDLSTrsquo13]

                                                                                                                        [Grsquo97][Trsquo97] [BDLSTrsquo13] [BDLSTrsquo13]

                                                                                                                        QR[EGrsquo98][FWrsquo03]

                                                                                                                        [DGHLrsquo12][BDLSTrsquo13]

                                                                                                                        [FWrsquo03][DGHLrsquo12][BDLSTrsquo13]

                                                                                                                        [EGrsquo98][FWrsquo03][BDLSTrsquo13]

                                                                                                                        [FWrsquo03][BDLSTrsquo13]

                                                                                                                        Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

                                                                                                                        Sym Eig amp SVD [BDDrsquo11][BDKrsquo13] [BDDrsquo11]

                                                                                                                        Non Sym Eig [BDDrsquo11] [BDDrsquo11]

                                                                                                                        Legend[Existing][Ours][Math-Lib][Random]

                                                                                                                        Words (BW) Messages (L) Saving factor

                                                                                                                        BLAS-3 [AGZrsquo94][MTrsquo99][ScaLAPACK][Crsquo69][vGWrsquo97][SDrsquo11] L nP12

                                                                                                                        Cholesky [ScaLAPACK][Trsquo99][SDrsquo11] L nP12

                                                                                                                        Sym Indefinite [BBDDDPSTYrsquo13][ScaLAPACK] [BBDDDPSTYrsquo13] L nP12

                                                                                                                        LU [ScaLAPACK][GDXrsquo11][Trsquo99][SDrsquo11] [GDXrsquo11][Trsquo99][SDrsquo11] L nP12

                                                                                                                        QR [ScaLAPACK][DGHLrsquo12] [Trsquo99] [DGHLrsquo12][Trsquo99] L nP12

                                                                                                                        Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

                                                                                                                        Sym Eig amp SVD [BDDrsquo11][BDKrsquo13][ScaLAPACK] [BDDrsquo11][BDKrsquo13] L nP12

                                                                                                                        Non-Sym Eig [BDDrsquo11] [BDDrsquo11] BW P12 L n

                                                                                                                        Attaining with extra memory 25D M=(cn2P)

                                                                                                                        Attaining the Lower bounds Parallel 2DM=(n2P)(Ignoring poly-log(P) factors words = ( n2 P12) messages = (P12)

                                                                                                                        Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                        ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                        ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                        bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                        bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                        Avoiding Communication in Iterative Linear Algebra

                                                                                                                        bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

                                                                                                                        bull Conjugate Gradients (CG) GMRES Lanczos Arnoldi hellip bull Goal minimize communication

                                                                                                                        ndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

                                                                                                                        bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

                                                                                                                        ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

                                                                                                                        bull Lots of speed up possible (modeled and measured)ndash Price some redundant computationndash Challenges Poor partitioning Preconditioning Num Stability

                                                                                                                        75

                                                                                                                        Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                        ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                        ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                        bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                        bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                        Example The Difficulty of Tuning SpMV

                                                                                                                        bull n = 21200bull nnz = 15 M

                                                                                                                        bull Source NASA structural analysis problem (raefsky)

                                                                                                                        77

                                                                                                                        Example The Difficulty of Tuning

                                                                                                                        bull n = 21200bull nnz = 15 M

                                                                                                                        bull Source NASA structural analysis problem (raefsky)

                                                                                                                        bull 8x8 dense substructure exploit this to limit mem_refs

                                                                                                                        78

                                                                                                                        Speedups on Itanium 2 The Need for Search

                                                                                                                        Reference

                                                                                                                        Best 4x2

                                                                                                                        Mflops

                                                                                                                        Mflops

                                                                                                                        79

                                                                                                                        Register Profile Itanium 2

                                                                                                                        190 Mflops

                                                                                                                        1190 Mflops

                                                                                                                        80

                                                                                                                        Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16

                                                                                                                        Itanium 2 - 33Itanium 1 - 8

                                                                                                                        252 Mflops

                                                                                                                        122 Mflops

                                                                                                                        820 Mflops

                                                                                                                        459 Mflops

                                                                                                                        247 Mflops

                                                                                                                        107 Mflops

                                                                                                                        12 Gflops

                                                                                                                        190 Mflops

                                                                                                                        Another example of tuning challenges for SpMV

                                                                                                                        bull Ex11 matrix (fluid flow)

                                                                                                                        bull More complicated non-zero structure in general

                                                                                                                        bull N = 16614bull NNZ = 11M

                                                                                                                        82

                                                                                                                        Zoom in to top corner

                                                                                                                        bull More complicated non-zero structure in general

                                                                                                                        bull N = 16614bull NNZ = 11M

                                                                                                                        83

                                                                                                                        3x3 blocks look natural buthellip

                                                                                                                        bull Example 3x3 blockingndash Logical grid of 3x3 cells

                                                                                                                        bull But would lead to lots of ldquofill-inrdquo

                                                                                                                        84

                                                                                                                        Extra Work Can Improve Efficiency

                                                                                                                        bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

                                                                                                                        bull On Pentium III 15x speedupndash Actual mflop rate 152 = 225 higher

                                                                                                                        85

                                                                                                                        Source Accelerator Cavity Design Problem (Ko via Husbands)

                                                                                                                        86

                                                                                                                        100x100 Submatrix Along Diagonal

                                                                                                                        Summer School Lecture 787

                                                                                                                        Post-RCM Reordering

                                                                                                                        88

                                                                                                                        Effect of Combined RCM+TSP Reordering

                                                                                                                        Before Green + RedAfter Green + Blue

                                                                                                                        Summer School Lecture 789

                                                                                                                        2x speedups on Pentium 4 Power 4 hellip

                                                                                                                        Summary of Other Performance Optimizations

                                                                                                                        bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

                                                                                                                        bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

                                                                                                                        bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

                                                                                                                        90

                                                                                                                        Optimized Sparse Kernel Interface - OSKI

                                                                                                                        bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

                                                                                                                        bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

                                                                                                                        bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

                                                                                                                        software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

                                                                                                                        91

                                                                                                                        Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                        ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                        ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                        bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                        bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                        93

                                                                                                                        Example Classical Conjugate Gradient (CG)

                                                                                                                        SpMVs and dot products require communication in

                                                                                                                        each iteration

                                                                                                                        via CA Matrix Powers Kernel

                                                                                                                        Global reduction to compute G

                                                                                                                        94

                                                                                                                        Example CA-Conjugate Gradient

                                                                                                                        Local computations within inner loop require

                                                                                                                        no communication

                                                                                                                        Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                        ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                        ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                        bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                        bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                        96

                                                                                                                        Slower convergence due

                                                                                                                        to roundoff

                                                                                                                        Loss of accuracy due to roundoff

                                                                                                                        At s = 16 monomial basis is rank deficient Method breaks down

                                                                                                                        Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

                                                                                                                        CA-CG (monomial)CG

                                                                                                                        machine precision

                                                                                                                        97

                                                                                                                        Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                        ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                        ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                        bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                        bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                        What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

                                                                                                                        bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

                                                                                                                        bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

                                                                                                                        matrices

                                                                                                                        Explicit (O(nnz)) Implicit (o(nnz))

                                                                                                                        Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

                                                                                                                        Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

                                                                                                                        Indices

                                                                                                                        Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                        ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                        ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                        bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                        bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                        101

                                                                                                                        bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

                                                                                                                        ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

                                                                                                                        demanded by customers (construction engineers) otherwise they donrsquot believe results

                                                                                                                        ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

                                                                                                                        ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

                                                                                                                        Reproducible Floating Point Computation

                                                                                                                        Absolute Error for Random Vectors

                                                                                                                        Same magnitude opposite signs

                                                                                                                        Intel MKL non-reproducibility

                                                                                                                        Relative Error for Orthogonal vectors

                                                                                                                        Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

                                                                                                                        Sign notreproducible

                                                                                                                        103

                                                                                                                        bull Consider summation or dot productbull Goals

                                                                                                                        1 Same answer independent of layout processors order of summands

                                                                                                                        2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

                                                                                                                        bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

                                                                                                                        GoalsApproaches for Reproducibility

                                                                                                                        104

                                                                                                                        Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

                                                                                                                        Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

                                                                                                                        Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

                                                                                                                        bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

                                                                                                                        Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

                                                                                                                        bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

                                                                                                                        Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

                                                                                                                        Instruments NEC Nokia NVIDIA Samsung Oracle

                                                                                                                        bull bebopcsberkeleyedu

                                                                                                                        Summary

                                                                                                                        Donrsquot Communichellip

                                                                                                                        106

                                                                                                                        Time to redesign all linear algebra n-body hellip algorithms and software

                                                                                                                        (and compilers)

                                                                                                                        • Implementing Communication-Avoiding Algorithms
                                                                                                                        • Why avoid communication
                                                                                                                        • Goals
                                                                                                                        • Outline
                                                                                                                        • Outline (2)
                                                                                                                        • Lower bound for all ldquon3-likerdquo linear algebra
                                                                                                                        • Lower bound for all ldquon3-likerdquo linear algebra (2)
                                                                                                                        • Lower bound for all ldquon3-likerdquo linear algebra (3)
                                                                                                                        • Limits to parallel scaling (12)
                                                                                                                        • Limits to parallel scaling (22)
                                                                                                                        • Can we attain these lower bounds
                                                                                                                        • Outline (3)
                                                                                                                        • 25D Matrix Multiplication
                                                                                                                        • 25D Matrix Multiplication (2)
                                                                                                                        • 25D Matmul on BGP 16K nodes 64K cores (2)
                                                                                                                        • Perfect Strong Scaling ndash in Time and Energy (12)
                                                                                                                        • Perfect Strong Scaling ndash in Time and Energy (22)
                                                                                                                        • Handling Heterogeneity
                                                                                                                        • Application to Tensor Contractions
                                                                                                                        • C(ijk) = Σm A(ijm)B(mk)
                                                                                                                        • Application to Tensor Contractions (2)
                                                                                                                        • Communication Lower Bounds for Strassen-like matmul algorithms
                                                                                                                        • vs
                                                                                                                        • Slide 26
                                                                                                                        • Strassen-like beyond matmul
                                                                                                                        • Cache and Network Oblivious Algorithms
                                                                                                                        • CARMA Performance Distributed Memory
                                                                                                                        • CARMA Performance Distributed Memory (2)
                                                                                                                        • CARMA Performance Shared Memory
                                                                                                                        • CARMA Performance Shared Memory (2)
                                                                                                                        • Why is CARMA Faster in Shared Memory
                                                                                                                        • Outline (4)
                                                                                                                        • One-sided Factorizations (LU QR) so far
                                                                                                                        • TSQR An Architecture-Dependent Algorithm
                                                                                                                        • Back to LU Using similar idea for TSLU as TSQR Use reduction
                                                                                                                        • Minimizing Communication in TSLU
                                                                                                                        • Making TSLU Numerically Stable
                                                                                                                        • Stability of LU using TSLU CALU
                                                                                                                        • Why is stability of TSLU just a ldquoThmrdquo
                                                                                                                        • Fixing TSLU
                                                                                                                        • 2D CALU with Tournament Pivoting
                                                                                                                        • 25D CALU with Tournament Pivoting (c=4 copies)
                                                                                                                        • Exascale Machine Parameters Source DOE Exascale Workshop
                                                                                                                        • Exascale predicted speedups for Gaussian Elimination 2D CA
                                                                                                                        • 25D vs 2D LU With and Without Pivoting
                                                                                                                        • Other CA algorithms for Ax=b least squares(13)
                                                                                                                        • Other CA algorithms for Ax=b least squares (23)
                                                                                                                        • Other CA algorithms for Ax=b least squares (33)
                                                                                                                        • Outline (5)
                                                                                                                        • What about sparse matrices (13)
                                                                                                                        • Performance of 25D APSP using Kleene
                                                                                                                        • What about sparse matrices (23)
                                                                                                                        • What about sparse matrices (33)
                                                                                                                        • Outline (6)
                                                                                                                        • Symmetric Eigenproblem and SVD
                                                                                                                        • Slide 58
                                                                                                                        • Slide 59
                                                                                                                        • Slide 60
                                                                                                                        • Slide 61
                                                                                                                        • Slide 62
                                                                                                                        • Slide 63
                                                                                                                        • Slide 64
                                                                                                                        • Slide 65
                                                                                                                        • Slide 66
                                                                                                                        • Slide 67
                                                                                                                        • Slide 68
                                                                                                                        • Conventional vs CA - SBR
                                                                                                                        • Speedups of Sym Band Reduction vs DSBTRD
                                                                                                                        • Nonsymmetric Eigenproblem
                                                                                                                        • Attaining the Lower bounds Sequential
                                                                                                                        • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
                                                                                                                        • Outline (7)
                                                                                                                        • Avoiding Communication in Iterative Linear Algebra
                                                                                                                        • Outline (8)
                                                                                                                        • Example The Difficulty of Tuning SpMV
                                                                                                                        • Example The Difficulty of Tuning
                                                                                                                        • Speedups on Itanium 2 The Need for Search
                                                                                                                        • Register Profile Itanium 2
                                                                                                                        • Register Profiles IBM and Intel IA-64
                                                                                                                        • Another example of tuning challenges for SpMV
                                                                                                                        • Zoom in to top corner
                                                                                                                        • 3x3 blocks look natural buthellip
                                                                                                                        • Extra Work Can Improve Efficiency
                                                                                                                        • Slide 86
                                                                                                                        • Slide 87
                                                                                                                        • Slide 88
                                                                                                                        • Slide 89
                                                                                                                        • Summary of Other Performance Optimizations
                                                                                                                        • Optimized Sparse Kernel Interface - OSKI
                                                                                                                        • Outline (9)
                                                                                                                        • Example Classical Conjugate Gradient (CG)
                                                                                                                        • Example CA-Conjugate Gradient
                                                                                                                        • Outline (10)
                                                                                                                        • Slide 96
                                                                                                                        • Slide 97
                                                                                                                        • Outline (11)
                                                                                                                        • What is a ldquosparse matrixrdquo
                                                                                                                        • Outline (12)
                                                                                                                        • Reproducible Floating Point Computation
                                                                                                                        • Intel MKL non-reproducibility
                                                                                                                        • GoalsApproaches for Reproducibility
                                                                                                                        • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
                                                                                                                        • Collaborators and Supporters
                                                                                                                        • Summary

                                                                                                                          1

                                                                                                                          1

                                                                                                                          2

                                                                                                                          2Q1

                                                                                                                          Q1T

                                                                                                                          b+1

                                                                                                                          b+1

                                                                                                                          d+1

                                                                                                                          d+1

                                                                                                                          cd+c

                                                                                                                          d+c

                                                                                                                          d+c

                                                                                                                          d+c

                                                                                                                          c

                                                                                                                          b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                                                          Successive Band Reduction (BischofLangSun)

                                                                                                                          1

                                                                                                                          1

                                                                                                                          2

                                                                                                                          2

                                                                                                                          3

                                                                                                                          3

                                                                                                                          Q1

                                                                                                                          Q1T

                                                                                                                          Q2

                                                                                                                          Q2T

                                                                                                                          b+1

                                                                                                                          b+1

                                                                                                                          d+1

                                                                                                                          d+1

                                                                                                                          d+c

                                                                                                                          d+c

                                                                                                                          d+c

                                                                                                                          d+c

                                                                                                                          c

                                                                                                                          c

                                                                                                                          b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                                                          Successive Band Reduction (BischofLangSun)

                                                                                                                          1

                                                                                                                          1

                                                                                                                          2

                                                                                                                          2

                                                                                                                          3

                                                                                                                          3

                                                                                                                          4

                                                                                                                          4

                                                                                                                          Q1

                                                                                                                          Q1T

                                                                                                                          Q2

                                                                                                                          Q2T

                                                                                                                          Q3

                                                                                                                          Q3T

                                                                                                                          b+1

                                                                                                                          b+1

                                                                                                                          d+1

                                                                                                                          d+1

                                                                                                                          d+c

                                                                                                                          d+c

                                                                                                                          d+c

                                                                                                                          d+c

                                                                                                                          c

                                                                                                                          c

                                                                                                                          b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                                                          Successive Band Reduction (BischofLangSun)

                                                                                                                          1

                                                                                                                          1

                                                                                                                          2

                                                                                                                          2

                                                                                                                          3

                                                                                                                          3

                                                                                                                          4

                                                                                                                          4

                                                                                                                          5

                                                                                                                          5

                                                                                                                          Q1

                                                                                                                          Q1T

                                                                                                                          Q2

                                                                                                                          Q2T

                                                                                                                          Q3

                                                                                                                          Q3T

                                                                                                                          Q4

                                                                                                                          Q4T

                                                                                                                          b+1

                                                                                                                          b+1

                                                                                                                          d+1

                                                                                                                          d+1

                                                                                                                          c

                                                                                                                          c

                                                                                                                          d+c

                                                                                                                          d+c

                                                                                                                          d+c

                                                                                                                          d+c

                                                                                                                          b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                                                          Successive Band Reduction (BischofLangSun)

                                                                                                                          1

                                                                                                                          1

                                                                                                                          2

                                                                                                                          2

                                                                                                                          3

                                                                                                                          3

                                                                                                                          4

                                                                                                                          4

                                                                                                                          5

                                                                                                                          5

                                                                                                                          Q5T

                                                                                                                          Q1

                                                                                                                          Q1T

                                                                                                                          Q2

                                                                                                                          Q2T

                                                                                                                          Q3

                                                                                                                          Q3T

                                                                                                                          Q5

                                                                                                                          Q4

                                                                                                                          Q4T

                                                                                                                          b+1

                                                                                                                          b+1

                                                                                                                          d+1

                                                                                                                          d+1

                                                                                                                          c

                                                                                                                          c

                                                                                                                          d+c

                                                                                                                          d+c

                                                                                                                          d+c

                                                                                                                          d+c

                                                                                                                          b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                                                          Successive Band Reduction (BischofLangSun)

                                                                                                                          1

                                                                                                                          1

                                                                                                                          2

                                                                                                                          2

                                                                                                                          3

                                                                                                                          3

                                                                                                                          4

                                                                                                                          4

                                                                                                                          5

                                                                                                                          5

                                                                                                                          6

                                                                                                                          6

                                                                                                                          Q5T

                                                                                                                          Q1

                                                                                                                          Q1T

                                                                                                                          Q2

                                                                                                                          Q2T

                                                                                                                          Q3

                                                                                                                          Q3T

                                                                                                                          Q5

                                                                                                                          Q4

                                                                                                                          Q4T

                                                                                                                          b+1

                                                                                                                          b+1

                                                                                                                          d+1

                                                                                                                          d+1

                                                                                                                          c

                                                                                                                          c

                                                                                                                          d+c

                                                                                                                          d+c

                                                                                                                          d+c

                                                                                                                          d+c

                                                                                                                          b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                                                          Successive Band Reduction (BischofLangSun)

                                                                                                                          Conventional vs CA - SBR

                                                                                                                          Conventional Communication-Avoiding

                                                                                                                          Touch all data 4 times Touch all data once

                                                                                                                          >
                                                                                                                          >

                                                                                                                          Speedups of Sym Band Reductionvs DSBTRD

                                                                                                                          bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

                                                                                                                          bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

                                                                                                                          bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

                                                                                                                          bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

                                                                                                                          bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

                                                                                                                          Nonsymmetric Eigenproblem

                                                                                                                          bull No apparent way to modify standard algorithmbull Instead Spectral Divide-and-Conquer

                                                                                                                          ndash Find orthogonal matrix Q whose leading columns span an invariant subspace of A

                                                                                                                          ndash QTAQ will be block upper triangularndash Apply recursively to A11 A22

                                                                                                                          ndash Depends on randomization1 Randomized Rank Revealing QR decomposition2 Randomized location to try splitting spectrum

                                                                                                                          A11 A12

                                                                                                                          ε A22

                                                                                                                          Attaining the Lower bounds SequentialLegend[Existing][Ours][Math-Lib][Random]

                                                                                                                          Two Levels Memory Hierarchy

                                                                                                                          Words Messages Words Messages

                                                                                                                          BLAS-3 [FLPRrsquo99][BDLSTrsquo13][MKL etc] [FLPRrsquo99][BDLSTrsquo13][MKL etc]

                                                                                                                          Cholesky[Grsquo97][APrsquo00]

                                                                                                                          [LAPACK][BDHSrsquo09]

                                                                                                                          [Grsquo97][APrsquo00][BDHSrsquo09] [Grsquo97][APrsquo00][BDHSrsquo09]

                                                                                                                          Sym Indefinite [BBDDDPSTYrsquo13] [BBDDDPSTYrsquo13]

                                                                                                                          LU[Grsquo97][Trsquo97]

                                                                                                                          [GDXrsquo11][BDLSTrsquo13]

                                                                                                                          [GDXrsquo11][BDLSTrsquo13]

                                                                                                                          [Grsquo97][Trsquo97] [BDLSTrsquo13] [BDLSTrsquo13]

                                                                                                                          QR[EGrsquo98][FWrsquo03]

                                                                                                                          [DGHLrsquo12][BDLSTrsquo13]

                                                                                                                          [FWrsquo03][DGHLrsquo12][BDLSTrsquo13]

                                                                                                                          [EGrsquo98][FWrsquo03][BDLSTrsquo13]

                                                                                                                          [FWrsquo03][BDLSTrsquo13]

                                                                                                                          Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

                                                                                                                          Sym Eig amp SVD [BDDrsquo11][BDKrsquo13] [BDDrsquo11]

                                                                                                                          Non Sym Eig [BDDrsquo11] [BDDrsquo11]

                                                                                                                          Legend[Existing][Ours][Math-Lib][Random]

                                                                                                                          Words (BW) Messages (L) Saving factor

                                                                                                                          BLAS-3 [AGZrsquo94][MTrsquo99][ScaLAPACK][Crsquo69][vGWrsquo97][SDrsquo11] L nP12

                                                                                                                          Cholesky [ScaLAPACK][Trsquo99][SDrsquo11] L nP12

                                                                                                                          Sym Indefinite [BBDDDPSTYrsquo13][ScaLAPACK] [BBDDDPSTYrsquo13] L nP12

                                                                                                                          LU [ScaLAPACK][GDXrsquo11][Trsquo99][SDrsquo11] [GDXrsquo11][Trsquo99][SDrsquo11] L nP12

                                                                                                                          QR [ScaLAPACK][DGHLrsquo12] [Trsquo99] [DGHLrsquo12][Trsquo99] L nP12

                                                                                                                          Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

                                                                                                                          Sym Eig amp SVD [BDDrsquo11][BDKrsquo13][ScaLAPACK] [BDDrsquo11][BDKrsquo13] L nP12

                                                                                                                          Non-Sym Eig [BDDrsquo11] [BDDrsquo11] BW P12 L n

                                                                                                                          Attaining with extra memory 25D M=(cn2P)

                                                                                                                          Attaining the Lower bounds Parallel 2DM=(n2P)(Ignoring poly-log(P) factors words = ( n2 P12) messages = (P12)

                                                                                                                          Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                          ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                          ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                          bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                          bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                          Avoiding Communication in Iterative Linear Algebra

                                                                                                                          bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

                                                                                                                          bull Conjugate Gradients (CG) GMRES Lanczos Arnoldi hellip bull Goal minimize communication

                                                                                                                          ndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

                                                                                                                          bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

                                                                                                                          ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

                                                                                                                          bull Lots of speed up possible (modeled and measured)ndash Price some redundant computationndash Challenges Poor partitioning Preconditioning Num Stability

                                                                                                                          75

                                                                                                                          Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                          ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                          ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                          bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                          bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                          Example The Difficulty of Tuning SpMV

                                                                                                                          bull n = 21200bull nnz = 15 M

                                                                                                                          bull Source NASA structural analysis problem (raefsky)

                                                                                                                          77

                                                                                                                          Example The Difficulty of Tuning

                                                                                                                          bull n = 21200bull nnz = 15 M

                                                                                                                          bull Source NASA structural analysis problem (raefsky)

                                                                                                                          bull 8x8 dense substructure exploit this to limit mem_refs

                                                                                                                          78

                                                                                                                          Speedups on Itanium 2 The Need for Search

                                                                                                                          Reference

                                                                                                                          Best 4x2

                                                                                                                          Mflops

                                                                                                                          Mflops

                                                                                                                          79

                                                                                                                          Register Profile Itanium 2

                                                                                                                          190 Mflops

                                                                                                                          1190 Mflops

                                                                                                                          80

                                                                                                                          Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16

                                                                                                                          Itanium 2 - 33Itanium 1 - 8

                                                                                                                          252 Mflops

                                                                                                                          122 Mflops

                                                                                                                          820 Mflops

                                                                                                                          459 Mflops

                                                                                                                          247 Mflops

                                                                                                                          107 Mflops

                                                                                                                          12 Gflops

                                                                                                                          190 Mflops

                                                                                                                          Another example of tuning challenges for SpMV

                                                                                                                          bull Ex11 matrix (fluid flow)

                                                                                                                          bull More complicated non-zero structure in general

                                                                                                                          bull N = 16614bull NNZ = 11M

                                                                                                                          82

                                                                                                                          Zoom in to top corner

                                                                                                                          bull More complicated non-zero structure in general

                                                                                                                          bull N = 16614bull NNZ = 11M

                                                                                                                          83

                                                                                                                          3x3 blocks look natural buthellip

                                                                                                                          bull Example 3x3 blockingndash Logical grid of 3x3 cells

                                                                                                                          bull But would lead to lots of ldquofill-inrdquo

                                                                                                                          84

                                                                                                                          Extra Work Can Improve Efficiency

                                                                                                                          bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

                                                                                                                          bull On Pentium III 15x speedupndash Actual mflop rate 152 = 225 higher

                                                                                                                          85

                                                                                                                          Source Accelerator Cavity Design Problem (Ko via Husbands)

                                                                                                                          86

                                                                                                                          100x100 Submatrix Along Diagonal

                                                                                                                          Summer School Lecture 787

                                                                                                                          Post-RCM Reordering

                                                                                                                          88

                                                                                                                          Effect of Combined RCM+TSP Reordering

                                                                                                                          Before Green + RedAfter Green + Blue

                                                                                                                          Summer School Lecture 789

                                                                                                                          2x speedups on Pentium 4 Power 4 hellip

                                                                                                                          Summary of Other Performance Optimizations

                                                                                                                          bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

                                                                                                                          bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

                                                                                                                          bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

                                                                                                                          90

                                                                                                                          Optimized Sparse Kernel Interface - OSKI

                                                                                                                          bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

                                                                                                                          bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

                                                                                                                          bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

                                                                                                                          software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

                                                                                                                          91

                                                                                                                          Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                          ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                          ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                          bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                          bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                          93

                                                                                                                          Example Classical Conjugate Gradient (CG)

                                                                                                                          SpMVs and dot products require communication in

                                                                                                                          each iteration

                                                                                                                          via CA Matrix Powers Kernel

                                                                                                                          Global reduction to compute G

                                                                                                                          94

                                                                                                                          Example CA-Conjugate Gradient

                                                                                                                          Local computations within inner loop require

                                                                                                                          no communication

                                                                                                                          Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                          ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                          ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                          bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                          bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                          96

                                                                                                                          Slower convergence due

                                                                                                                          to roundoff

                                                                                                                          Loss of accuracy due to roundoff

                                                                                                                          At s = 16 monomial basis is rank deficient Method breaks down

                                                                                                                          Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

                                                                                                                          CA-CG (monomial)CG

                                                                                                                          machine precision

                                                                                                                          97

                                                                                                                          Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                          ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                          ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                          bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                          bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                          What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

                                                                                                                          bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

                                                                                                                          bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

                                                                                                                          matrices

                                                                                                                          Explicit (O(nnz)) Implicit (o(nnz))

                                                                                                                          Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

                                                                                                                          Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

                                                                                                                          Indices

                                                                                                                          Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                          ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                          ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                          bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                          bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                          101

                                                                                                                          bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

                                                                                                                          ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

                                                                                                                          demanded by customers (construction engineers) otherwise they donrsquot believe results

                                                                                                                          ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

                                                                                                                          ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

                                                                                                                          Reproducible Floating Point Computation

                                                                                                                          Absolute Error for Random Vectors

                                                                                                                          Same magnitude opposite signs

                                                                                                                          Intel MKL non-reproducibility

                                                                                                                          Relative Error for Orthogonal vectors

                                                                                                                          Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

                                                                                                                          Sign notreproducible

                                                                                                                          103

                                                                                                                          bull Consider summation or dot productbull Goals

                                                                                                                          1 Same answer independent of layout processors order of summands

                                                                                                                          2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

                                                                                                                          bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

                                                                                                                          GoalsApproaches for Reproducibility

                                                                                                                          104

                                                                                                                          Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

                                                                                                                          Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

                                                                                                                          Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

                                                                                                                          bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

                                                                                                                          Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

                                                                                                                          bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

                                                                                                                          Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

                                                                                                                          Instruments NEC Nokia NVIDIA Samsung Oracle

                                                                                                                          bull bebopcsberkeleyedu

                                                                                                                          Summary

                                                                                                                          Donrsquot Communichellip

                                                                                                                          106

                                                                                                                          Time to redesign all linear algebra n-body hellip algorithms and software

                                                                                                                          (and compilers)

                                                                                                                          • Implementing Communication-Avoiding Algorithms
                                                                                                                          • Why avoid communication
                                                                                                                          • Goals
                                                                                                                          • Outline
                                                                                                                          • Outline (2)
                                                                                                                          • Lower bound for all ldquon3-likerdquo linear algebra
                                                                                                                          • Lower bound for all ldquon3-likerdquo linear algebra (2)
                                                                                                                          • Lower bound for all ldquon3-likerdquo linear algebra (3)
                                                                                                                          • Limits to parallel scaling (12)
                                                                                                                          • Limits to parallel scaling (22)
                                                                                                                          • Can we attain these lower bounds
                                                                                                                          • Outline (3)
                                                                                                                          • 25D Matrix Multiplication
                                                                                                                          • 25D Matrix Multiplication (2)
                                                                                                                          • 25D Matmul on BGP 16K nodes 64K cores (2)
                                                                                                                          • Perfect Strong Scaling ndash in Time and Energy (12)
                                                                                                                          • Perfect Strong Scaling ndash in Time and Energy (22)
                                                                                                                          • Handling Heterogeneity
                                                                                                                          • Application to Tensor Contractions
                                                                                                                          • C(ijk) = Σm A(ijm)B(mk)
                                                                                                                          • Application to Tensor Contractions (2)
                                                                                                                          • Communication Lower Bounds for Strassen-like matmul algorithms
                                                                                                                          • vs
                                                                                                                          • Slide 26
                                                                                                                          • Strassen-like beyond matmul
                                                                                                                          • Cache and Network Oblivious Algorithms
                                                                                                                          • CARMA Performance Distributed Memory
                                                                                                                          • CARMA Performance Distributed Memory (2)
                                                                                                                          • CARMA Performance Shared Memory
                                                                                                                          • CARMA Performance Shared Memory (2)
                                                                                                                          • Why is CARMA Faster in Shared Memory
                                                                                                                          • Outline (4)
                                                                                                                          • One-sided Factorizations (LU QR) so far
                                                                                                                          • TSQR An Architecture-Dependent Algorithm
                                                                                                                          • Back to LU Using similar idea for TSLU as TSQR Use reduction
                                                                                                                          • Minimizing Communication in TSLU
                                                                                                                          • Making TSLU Numerically Stable
                                                                                                                          • Stability of LU using TSLU CALU
                                                                                                                          • Why is stability of TSLU just a ldquoThmrdquo
                                                                                                                          • Fixing TSLU
                                                                                                                          • 2D CALU with Tournament Pivoting
                                                                                                                          • 25D CALU with Tournament Pivoting (c=4 copies)
                                                                                                                          • Exascale Machine Parameters Source DOE Exascale Workshop
                                                                                                                          • Exascale predicted speedups for Gaussian Elimination 2D CA
                                                                                                                          • 25D vs 2D LU With and Without Pivoting
                                                                                                                          • Other CA algorithms for Ax=b least squares(13)
                                                                                                                          • Other CA algorithms for Ax=b least squares (23)
                                                                                                                          • Other CA algorithms for Ax=b least squares (33)
                                                                                                                          • Outline (5)
                                                                                                                          • What about sparse matrices (13)
                                                                                                                          • Performance of 25D APSP using Kleene
                                                                                                                          • What about sparse matrices (23)
                                                                                                                          • What about sparse matrices (33)
                                                                                                                          • Outline (6)
                                                                                                                          • Symmetric Eigenproblem and SVD
                                                                                                                          • Slide 58
                                                                                                                          • Slide 59
                                                                                                                          • Slide 60
                                                                                                                          • Slide 61
                                                                                                                          • Slide 62
                                                                                                                          • Slide 63
                                                                                                                          • Slide 64
                                                                                                                          • Slide 65
                                                                                                                          • Slide 66
                                                                                                                          • Slide 67
                                                                                                                          • Slide 68
                                                                                                                          • Conventional vs CA - SBR
                                                                                                                          • Speedups of Sym Band Reduction vs DSBTRD
                                                                                                                          • Nonsymmetric Eigenproblem
                                                                                                                          • Attaining the Lower bounds Sequential
                                                                                                                          • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
                                                                                                                          • Outline (7)
                                                                                                                          • Avoiding Communication in Iterative Linear Algebra
                                                                                                                          • Outline (8)
                                                                                                                          • Example The Difficulty of Tuning SpMV
                                                                                                                          • Example The Difficulty of Tuning
                                                                                                                          • Speedups on Itanium 2 The Need for Search
                                                                                                                          • Register Profile Itanium 2
                                                                                                                          • Register Profiles IBM and Intel IA-64
                                                                                                                          • Another example of tuning challenges for SpMV
                                                                                                                          • Zoom in to top corner
                                                                                                                          • 3x3 blocks look natural buthellip
                                                                                                                          • Extra Work Can Improve Efficiency
                                                                                                                          • Slide 86
                                                                                                                          • Slide 87
                                                                                                                          • Slide 88
                                                                                                                          • Slide 89
                                                                                                                          • Summary of Other Performance Optimizations
                                                                                                                          • Optimized Sparse Kernel Interface - OSKI
                                                                                                                          • Outline (9)
                                                                                                                          • Example Classical Conjugate Gradient (CG)
                                                                                                                          • Example CA-Conjugate Gradient
                                                                                                                          • Outline (10)
                                                                                                                          • Slide 96
                                                                                                                          • Slide 97
                                                                                                                          • Outline (11)
                                                                                                                          • What is a ldquosparse matrixrdquo
                                                                                                                          • Outline (12)
                                                                                                                          • Reproducible Floating Point Computation
                                                                                                                          • Intel MKL non-reproducibility
                                                                                                                          • GoalsApproaches for Reproducibility
                                                                                                                          • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
                                                                                                                          • Collaborators and Supporters
                                                                                                                          • Summary

                                                                                                                            1

                                                                                                                            1

                                                                                                                            2

                                                                                                                            2

                                                                                                                            3

                                                                                                                            3

                                                                                                                            Q1

                                                                                                                            Q1T

                                                                                                                            Q2

                                                                                                                            Q2T

                                                                                                                            b+1

                                                                                                                            b+1

                                                                                                                            d+1

                                                                                                                            d+1

                                                                                                                            d+c

                                                                                                                            d+c

                                                                                                                            d+c

                                                                                                                            d+c

                                                                                                                            c

                                                                                                                            c

                                                                                                                            b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                                                            Successive Band Reduction (BischofLangSun)

                                                                                                                            1

                                                                                                                            1

                                                                                                                            2

                                                                                                                            2

                                                                                                                            3

                                                                                                                            3

                                                                                                                            4

                                                                                                                            4

                                                                                                                            Q1

                                                                                                                            Q1T

                                                                                                                            Q2

                                                                                                                            Q2T

                                                                                                                            Q3

                                                                                                                            Q3T

                                                                                                                            b+1

                                                                                                                            b+1

                                                                                                                            d+1

                                                                                                                            d+1

                                                                                                                            d+c

                                                                                                                            d+c

                                                                                                                            d+c

                                                                                                                            d+c

                                                                                                                            c

                                                                                                                            c

                                                                                                                            b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                                                            Successive Band Reduction (BischofLangSun)

                                                                                                                            1

                                                                                                                            1

                                                                                                                            2

                                                                                                                            2

                                                                                                                            3

                                                                                                                            3

                                                                                                                            4

                                                                                                                            4

                                                                                                                            5

                                                                                                                            5

                                                                                                                            Q1

                                                                                                                            Q1T

                                                                                                                            Q2

                                                                                                                            Q2T

                                                                                                                            Q3

                                                                                                                            Q3T

                                                                                                                            Q4

                                                                                                                            Q4T

                                                                                                                            b+1

                                                                                                                            b+1

                                                                                                                            d+1

                                                                                                                            d+1

                                                                                                                            c

                                                                                                                            c

                                                                                                                            d+c

                                                                                                                            d+c

                                                                                                                            d+c

                                                                                                                            d+c

                                                                                                                            b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                                                            Successive Band Reduction (BischofLangSun)

                                                                                                                            1

                                                                                                                            1

                                                                                                                            2

                                                                                                                            2

                                                                                                                            3

                                                                                                                            3

                                                                                                                            4

                                                                                                                            4

                                                                                                                            5

                                                                                                                            5

                                                                                                                            Q5T

                                                                                                                            Q1

                                                                                                                            Q1T

                                                                                                                            Q2

                                                                                                                            Q2T

                                                                                                                            Q3

                                                                                                                            Q3T

                                                                                                                            Q5

                                                                                                                            Q4

                                                                                                                            Q4T

                                                                                                                            b+1

                                                                                                                            b+1

                                                                                                                            d+1

                                                                                                                            d+1

                                                                                                                            c

                                                                                                                            c

                                                                                                                            d+c

                                                                                                                            d+c

                                                                                                                            d+c

                                                                                                                            d+c

                                                                                                                            b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                                                            Successive Band Reduction (BischofLangSun)

                                                                                                                            1

                                                                                                                            1

                                                                                                                            2

                                                                                                                            2

                                                                                                                            3

                                                                                                                            3

                                                                                                                            4

                                                                                                                            4

                                                                                                                            5

                                                                                                                            5

                                                                                                                            6

                                                                                                                            6

                                                                                                                            Q5T

                                                                                                                            Q1

                                                                                                                            Q1T

                                                                                                                            Q2

                                                                                                                            Q2T

                                                                                                                            Q3

                                                                                                                            Q3T

                                                                                                                            Q5

                                                                                                                            Q4

                                                                                                                            Q4T

                                                                                                                            b+1

                                                                                                                            b+1

                                                                                                                            d+1

                                                                                                                            d+1

                                                                                                                            c

                                                                                                                            c

                                                                                                                            d+c

                                                                                                                            d+c

                                                                                                                            d+c

                                                                                                                            d+c

                                                                                                                            b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                                                            Successive Band Reduction (BischofLangSun)

                                                                                                                            Conventional vs CA - SBR

                                                                                                                            Conventional Communication-Avoiding

                                                                                                                            Touch all data 4 times Touch all data once

                                                                                                                            >
                                                                                                                            >

                                                                                                                            Speedups of Sym Band Reductionvs DSBTRD

                                                                                                                            bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

                                                                                                                            bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

                                                                                                                            bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

                                                                                                                            bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

                                                                                                                            bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

                                                                                                                            Nonsymmetric Eigenproblem

                                                                                                                            bull No apparent way to modify standard algorithmbull Instead Spectral Divide-and-Conquer

                                                                                                                            ndash Find orthogonal matrix Q whose leading columns span an invariant subspace of A

                                                                                                                            ndash QTAQ will be block upper triangularndash Apply recursively to A11 A22

                                                                                                                            ndash Depends on randomization1 Randomized Rank Revealing QR decomposition2 Randomized location to try splitting spectrum

                                                                                                                            A11 A12

                                                                                                                            ε A22

                                                                                                                            Attaining the Lower bounds SequentialLegend[Existing][Ours][Math-Lib][Random]

                                                                                                                            Two Levels Memory Hierarchy

                                                                                                                            Words Messages Words Messages

                                                                                                                            BLAS-3 [FLPRrsquo99][BDLSTrsquo13][MKL etc] [FLPRrsquo99][BDLSTrsquo13][MKL etc]

                                                                                                                            Cholesky[Grsquo97][APrsquo00]

                                                                                                                            [LAPACK][BDHSrsquo09]

                                                                                                                            [Grsquo97][APrsquo00][BDHSrsquo09] [Grsquo97][APrsquo00][BDHSrsquo09]

                                                                                                                            Sym Indefinite [BBDDDPSTYrsquo13] [BBDDDPSTYrsquo13]

                                                                                                                            LU[Grsquo97][Trsquo97]

                                                                                                                            [GDXrsquo11][BDLSTrsquo13]

                                                                                                                            [GDXrsquo11][BDLSTrsquo13]

                                                                                                                            [Grsquo97][Trsquo97] [BDLSTrsquo13] [BDLSTrsquo13]

                                                                                                                            QR[EGrsquo98][FWrsquo03]

                                                                                                                            [DGHLrsquo12][BDLSTrsquo13]

                                                                                                                            [FWrsquo03][DGHLrsquo12][BDLSTrsquo13]

                                                                                                                            [EGrsquo98][FWrsquo03][BDLSTrsquo13]

                                                                                                                            [FWrsquo03][BDLSTrsquo13]

                                                                                                                            Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

                                                                                                                            Sym Eig amp SVD [BDDrsquo11][BDKrsquo13] [BDDrsquo11]

                                                                                                                            Non Sym Eig [BDDrsquo11] [BDDrsquo11]

                                                                                                                            Legend[Existing][Ours][Math-Lib][Random]

                                                                                                                            Words (BW) Messages (L) Saving factor

                                                                                                                            BLAS-3 [AGZrsquo94][MTrsquo99][ScaLAPACK][Crsquo69][vGWrsquo97][SDrsquo11] L nP12

                                                                                                                            Cholesky [ScaLAPACK][Trsquo99][SDrsquo11] L nP12

                                                                                                                            Sym Indefinite [BBDDDPSTYrsquo13][ScaLAPACK] [BBDDDPSTYrsquo13] L nP12

                                                                                                                            LU [ScaLAPACK][GDXrsquo11][Trsquo99][SDrsquo11] [GDXrsquo11][Trsquo99][SDrsquo11] L nP12

                                                                                                                            QR [ScaLAPACK][DGHLrsquo12] [Trsquo99] [DGHLrsquo12][Trsquo99] L nP12

                                                                                                                            Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

                                                                                                                            Sym Eig amp SVD [BDDrsquo11][BDKrsquo13][ScaLAPACK] [BDDrsquo11][BDKrsquo13] L nP12

                                                                                                                            Non-Sym Eig [BDDrsquo11] [BDDrsquo11] BW P12 L n

                                                                                                                            Attaining with extra memory 25D M=(cn2P)

                                                                                                                            Attaining the Lower bounds Parallel 2DM=(n2P)(Ignoring poly-log(P) factors words = ( n2 P12) messages = (P12)

                                                                                                                            Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                            ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                            ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                            bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                            bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                            Avoiding Communication in Iterative Linear Algebra

                                                                                                                            bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

                                                                                                                            bull Conjugate Gradients (CG) GMRES Lanczos Arnoldi hellip bull Goal minimize communication

                                                                                                                            ndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

                                                                                                                            bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

                                                                                                                            ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

                                                                                                                            bull Lots of speed up possible (modeled and measured)ndash Price some redundant computationndash Challenges Poor partitioning Preconditioning Num Stability

                                                                                                                            75

                                                                                                                            Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                            ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                            ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                            bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                            bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                            Example The Difficulty of Tuning SpMV

                                                                                                                            bull n = 21200bull nnz = 15 M

                                                                                                                            bull Source NASA structural analysis problem (raefsky)

                                                                                                                            77

                                                                                                                            Example The Difficulty of Tuning

                                                                                                                            bull n = 21200bull nnz = 15 M

                                                                                                                            bull Source NASA structural analysis problem (raefsky)

                                                                                                                            bull 8x8 dense substructure exploit this to limit mem_refs

                                                                                                                            78

                                                                                                                            Speedups on Itanium 2 The Need for Search

                                                                                                                            Reference

                                                                                                                            Best 4x2

                                                                                                                            Mflops

                                                                                                                            Mflops

                                                                                                                            79

                                                                                                                            Register Profile Itanium 2

                                                                                                                            190 Mflops

                                                                                                                            1190 Mflops

                                                                                                                            80

                                                                                                                            Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16

                                                                                                                            Itanium 2 - 33Itanium 1 - 8

                                                                                                                            252 Mflops

                                                                                                                            122 Mflops

                                                                                                                            820 Mflops

                                                                                                                            459 Mflops

                                                                                                                            247 Mflops

                                                                                                                            107 Mflops

                                                                                                                            12 Gflops

                                                                                                                            190 Mflops

                                                                                                                            Another example of tuning challenges for SpMV

                                                                                                                            bull Ex11 matrix (fluid flow)

                                                                                                                            bull More complicated non-zero structure in general

                                                                                                                            bull N = 16614bull NNZ = 11M

                                                                                                                            82

                                                                                                                            Zoom in to top corner

                                                                                                                            bull More complicated non-zero structure in general

                                                                                                                            bull N = 16614bull NNZ = 11M

                                                                                                                            83

                                                                                                                            3x3 blocks look natural buthellip

                                                                                                                            bull Example 3x3 blockingndash Logical grid of 3x3 cells

                                                                                                                            bull But would lead to lots of ldquofill-inrdquo

                                                                                                                            84

                                                                                                                            Extra Work Can Improve Efficiency

                                                                                                                            bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

                                                                                                                            bull On Pentium III 15x speedupndash Actual mflop rate 152 = 225 higher

                                                                                                                            85

                                                                                                                            Source Accelerator Cavity Design Problem (Ko via Husbands)

                                                                                                                            86

                                                                                                                            100x100 Submatrix Along Diagonal

                                                                                                                            Summer School Lecture 787

                                                                                                                            Post-RCM Reordering

                                                                                                                            88

                                                                                                                            Effect of Combined RCM+TSP Reordering

                                                                                                                            Before Green + RedAfter Green + Blue

                                                                                                                            Summer School Lecture 789

                                                                                                                            2x speedups on Pentium 4 Power 4 hellip

                                                                                                                            Summary of Other Performance Optimizations

                                                                                                                            bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

                                                                                                                            bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

                                                                                                                            bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

                                                                                                                            90

                                                                                                                            Optimized Sparse Kernel Interface - OSKI

                                                                                                                            bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

                                                                                                                            bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

                                                                                                                            bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

                                                                                                                            software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

                                                                                                                            91

                                                                                                                            Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                            ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                            ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                            bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                            bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                            93

                                                                                                                            Example Classical Conjugate Gradient (CG)

                                                                                                                            SpMVs and dot products require communication in

                                                                                                                            each iteration

                                                                                                                            via CA Matrix Powers Kernel

                                                                                                                            Global reduction to compute G

                                                                                                                            94

                                                                                                                            Example CA-Conjugate Gradient

                                                                                                                            Local computations within inner loop require

                                                                                                                            no communication

                                                                                                                            Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                            ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                            ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                            bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                            bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                            96

                                                                                                                            Slower convergence due

                                                                                                                            to roundoff

                                                                                                                            Loss of accuracy due to roundoff

                                                                                                                            At s = 16 monomial basis is rank deficient Method breaks down

                                                                                                                            Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

                                                                                                                            CA-CG (monomial)CG

                                                                                                                            machine precision

                                                                                                                            97

                                                                                                                            Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                            ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                            ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                            bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                            bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                            What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

                                                                                                                            bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

                                                                                                                            bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

                                                                                                                            matrices

                                                                                                                            Explicit (O(nnz)) Implicit (o(nnz))

                                                                                                                            Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

                                                                                                                            Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

                                                                                                                            Indices

                                                                                                                            Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                            ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                            ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                            bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                            bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                            101

                                                                                                                            bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

                                                                                                                            ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

                                                                                                                            demanded by customers (construction engineers) otherwise they donrsquot believe results

                                                                                                                            ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

                                                                                                                            ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

                                                                                                                            Reproducible Floating Point Computation

                                                                                                                            Absolute Error for Random Vectors

                                                                                                                            Same magnitude opposite signs

                                                                                                                            Intel MKL non-reproducibility

                                                                                                                            Relative Error for Orthogonal vectors

                                                                                                                            Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

                                                                                                                            Sign notreproducible

                                                                                                                            103

                                                                                                                            bull Consider summation or dot productbull Goals

                                                                                                                            1 Same answer independent of layout processors order of summands

                                                                                                                            2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

                                                                                                                            bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

                                                                                                                            GoalsApproaches for Reproducibility

                                                                                                                            104

                                                                                                                            Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

                                                                                                                            Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

                                                                                                                            Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

                                                                                                                            bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

                                                                                                                            Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

                                                                                                                            bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

                                                                                                                            Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

                                                                                                                            Instruments NEC Nokia NVIDIA Samsung Oracle

                                                                                                                            bull bebopcsberkeleyedu

                                                                                                                            Summary

                                                                                                                            Donrsquot Communichellip

                                                                                                                            106

                                                                                                                            Time to redesign all linear algebra n-body hellip algorithms and software

                                                                                                                            (and compilers)

                                                                                                                            • Implementing Communication-Avoiding Algorithms
                                                                                                                            • Why avoid communication
                                                                                                                            • Goals
                                                                                                                            • Outline
                                                                                                                            • Outline (2)
                                                                                                                            • Lower bound for all ldquon3-likerdquo linear algebra
                                                                                                                            • Lower bound for all ldquon3-likerdquo linear algebra (2)
                                                                                                                            • Lower bound for all ldquon3-likerdquo linear algebra (3)
                                                                                                                            • Limits to parallel scaling (12)
                                                                                                                            • Limits to parallel scaling (22)
                                                                                                                            • Can we attain these lower bounds
                                                                                                                            • Outline (3)
                                                                                                                            • 25D Matrix Multiplication
                                                                                                                            • 25D Matrix Multiplication (2)
                                                                                                                            • 25D Matmul on BGP 16K nodes 64K cores (2)
                                                                                                                            • Perfect Strong Scaling ndash in Time and Energy (12)
                                                                                                                            • Perfect Strong Scaling ndash in Time and Energy (22)
                                                                                                                            • Handling Heterogeneity
                                                                                                                            • Application to Tensor Contractions
                                                                                                                            • C(ijk) = Σm A(ijm)B(mk)
                                                                                                                            • Application to Tensor Contractions (2)
                                                                                                                            • Communication Lower Bounds for Strassen-like matmul algorithms
                                                                                                                            • vs
                                                                                                                            • Slide 26
                                                                                                                            • Strassen-like beyond matmul
                                                                                                                            • Cache and Network Oblivious Algorithms
                                                                                                                            • CARMA Performance Distributed Memory
                                                                                                                            • CARMA Performance Distributed Memory (2)
                                                                                                                            • CARMA Performance Shared Memory
                                                                                                                            • CARMA Performance Shared Memory (2)
                                                                                                                            • Why is CARMA Faster in Shared Memory
                                                                                                                            • Outline (4)
                                                                                                                            • One-sided Factorizations (LU QR) so far
                                                                                                                            • TSQR An Architecture-Dependent Algorithm
                                                                                                                            • Back to LU Using similar idea for TSLU as TSQR Use reduction
                                                                                                                            • Minimizing Communication in TSLU
                                                                                                                            • Making TSLU Numerically Stable
                                                                                                                            • Stability of LU using TSLU CALU
                                                                                                                            • Why is stability of TSLU just a ldquoThmrdquo
                                                                                                                            • Fixing TSLU
                                                                                                                            • 2D CALU with Tournament Pivoting
                                                                                                                            • 25D CALU with Tournament Pivoting (c=4 copies)
                                                                                                                            • Exascale Machine Parameters Source DOE Exascale Workshop
                                                                                                                            • Exascale predicted speedups for Gaussian Elimination 2D CA
                                                                                                                            • 25D vs 2D LU With and Without Pivoting
                                                                                                                            • Other CA algorithms for Ax=b least squares(13)
                                                                                                                            • Other CA algorithms for Ax=b least squares (23)
                                                                                                                            • Other CA algorithms for Ax=b least squares (33)
                                                                                                                            • Outline (5)
                                                                                                                            • What about sparse matrices (13)
                                                                                                                            • Performance of 25D APSP using Kleene
                                                                                                                            • What about sparse matrices (23)
                                                                                                                            • What about sparse matrices (33)
                                                                                                                            • Outline (6)
                                                                                                                            • Symmetric Eigenproblem and SVD
                                                                                                                            • Slide 58
                                                                                                                            • Slide 59
                                                                                                                            • Slide 60
                                                                                                                            • Slide 61
                                                                                                                            • Slide 62
                                                                                                                            • Slide 63
                                                                                                                            • Slide 64
                                                                                                                            • Slide 65
                                                                                                                            • Slide 66
                                                                                                                            • Slide 67
                                                                                                                            • Slide 68
                                                                                                                            • Conventional vs CA - SBR
                                                                                                                            • Speedups of Sym Band Reduction vs DSBTRD
                                                                                                                            • Nonsymmetric Eigenproblem
                                                                                                                            • Attaining the Lower bounds Sequential
                                                                                                                            • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
                                                                                                                            • Outline (7)
                                                                                                                            • Avoiding Communication in Iterative Linear Algebra
                                                                                                                            • Outline (8)
                                                                                                                            • Example The Difficulty of Tuning SpMV
                                                                                                                            • Example The Difficulty of Tuning
                                                                                                                            • Speedups on Itanium 2 The Need for Search
                                                                                                                            • Register Profile Itanium 2
                                                                                                                            • Register Profiles IBM and Intel IA-64
                                                                                                                            • Another example of tuning challenges for SpMV
                                                                                                                            • Zoom in to top corner
                                                                                                                            • 3x3 blocks look natural buthellip
                                                                                                                            • Extra Work Can Improve Efficiency
                                                                                                                            • Slide 86
                                                                                                                            • Slide 87
                                                                                                                            • Slide 88
                                                                                                                            • Slide 89
                                                                                                                            • Summary of Other Performance Optimizations
                                                                                                                            • Optimized Sparse Kernel Interface - OSKI
                                                                                                                            • Outline (9)
                                                                                                                            • Example Classical Conjugate Gradient (CG)
                                                                                                                            • Example CA-Conjugate Gradient
                                                                                                                            • Outline (10)
                                                                                                                            • Slide 96
                                                                                                                            • Slide 97
                                                                                                                            • Outline (11)
                                                                                                                            • What is a ldquosparse matrixrdquo
                                                                                                                            • Outline (12)
                                                                                                                            • Reproducible Floating Point Computation
                                                                                                                            • Intel MKL non-reproducibility
                                                                                                                            • GoalsApproaches for Reproducibility
                                                                                                                            • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
                                                                                                                            • Collaborators and Supporters
                                                                                                                            • Summary

                                                                                                                              1

                                                                                                                              1

                                                                                                                              2

                                                                                                                              2

                                                                                                                              3

                                                                                                                              3

                                                                                                                              4

                                                                                                                              4

                                                                                                                              Q1

                                                                                                                              Q1T

                                                                                                                              Q2

                                                                                                                              Q2T

                                                                                                                              Q3

                                                                                                                              Q3T

                                                                                                                              b+1

                                                                                                                              b+1

                                                                                                                              d+1

                                                                                                                              d+1

                                                                                                                              d+c

                                                                                                                              d+c

                                                                                                                              d+c

                                                                                                                              d+c

                                                                                                                              c

                                                                                                                              c

                                                                                                                              b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                                                              Successive Band Reduction (BischofLangSun)

                                                                                                                              1

                                                                                                                              1

                                                                                                                              2

                                                                                                                              2

                                                                                                                              3

                                                                                                                              3

                                                                                                                              4

                                                                                                                              4

                                                                                                                              5

                                                                                                                              5

                                                                                                                              Q1

                                                                                                                              Q1T

                                                                                                                              Q2

                                                                                                                              Q2T

                                                                                                                              Q3

                                                                                                                              Q3T

                                                                                                                              Q4

                                                                                                                              Q4T

                                                                                                                              b+1

                                                                                                                              b+1

                                                                                                                              d+1

                                                                                                                              d+1

                                                                                                                              c

                                                                                                                              c

                                                                                                                              d+c

                                                                                                                              d+c

                                                                                                                              d+c

                                                                                                                              d+c

                                                                                                                              b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                                                              Successive Band Reduction (BischofLangSun)

                                                                                                                              1

                                                                                                                              1

                                                                                                                              2

                                                                                                                              2

                                                                                                                              3

                                                                                                                              3

                                                                                                                              4

                                                                                                                              4

                                                                                                                              5

                                                                                                                              5

                                                                                                                              Q5T

                                                                                                                              Q1

                                                                                                                              Q1T

                                                                                                                              Q2

                                                                                                                              Q2T

                                                                                                                              Q3

                                                                                                                              Q3T

                                                                                                                              Q5

                                                                                                                              Q4

                                                                                                                              Q4T

                                                                                                                              b+1

                                                                                                                              b+1

                                                                                                                              d+1

                                                                                                                              d+1

                                                                                                                              c

                                                                                                                              c

                                                                                                                              d+c

                                                                                                                              d+c

                                                                                                                              d+c

                                                                                                                              d+c

                                                                                                                              b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                                                              Successive Band Reduction (BischofLangSun)

                                                                                                                              1

                                                                                                                              1

                                                                                                                              2

                                                                                                                              2

                                                                                                                              3

                                                                                                                              3

                                                                                                                              4

                                                                                                                              4

                                                                                                                              5

                                                                                                                              5

                                                                                                                              6

                                                                                                                              6

                                                                                                                              Q5T

                                                                                                                              Q1

                                                                                                                              Q1T

                                                                                                                              Q2

                                                                                                                              Q2T

                                                                                                                              Q3

                                                                                                                              Q3T

                                                                                                                              Q5

                                                                                                                              Q4

                                                                                                                              Q4T

                                                                                                                              b+1

                                                                                                                              b+1

                                                                                                                              d+1

                                                                                                                              d+1

                                                                                                                              c

                                                                                                                              c

                                                                                                                              d+c

                                                                                                                              d+c

                                                                                                                              d+c

                                                                                                                              d+c

                                                                                                                              b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                                                              Successive Band Reduction (BischofLangSun)

                                                                                                                              Conventional vs CA - SBR

                                                                                                                              Conventional Communication-Avoiding

                                                                                                                              Touch all data 4 times Touch all data once

                                                                                                                              >
                                                                                                                              >

                                                                                                                              Speedups of Sym Band Reductionvs DSBTRD

                                                                                                                              bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

                                                                                                                              bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

                                                                                                                              bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

                                                                                                                              bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

                                                                                                                              bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

                                                                                                                              Nonsymmetric Eigenproblem

                                                                                                                              bull No apparent way to modify standard algorithmbull Instead Spectral Divide-and-Conquer

                                                                                                                              ndash Find orthogonal matrix Q whose leading columns span an invariant subspace of A

                                                                                                                              ndash QTAQ will be block upper triangularndash Apply recursively to A11 A22

                                                                                                                              ndash Depends on randomization1 Randomized Rank Revealing QR decomposition2 Randomized location to try splitting spectrum

                                                                                                                              A11 A12

                                                                                                                              ε A22

                                                                                                                              Attaining the Lower bounds SequentialLegend[Existing][Ours][Math-Lib][Random]

                                                                                                                              Two Levels Memory Hierarchy

                                                                                                                              Words Messages Words Messages

                                                                                                                              BLAS-3 [FLPRrsquo99][BDLSTrsquo13][MKL etc] [FLPRrsquo99][BDLSTrsquo13][MKL etc]

                                                                                                                              Cholesky[Grsquo97][APrsquo00]

                                                                                                                              [LAPACK][BDHSrsquo09]

                                                                                                                              [Grsquo97][APrsquo00][BDHSrsquo09] [Grsquo97][APrsquo00][BDHSrsquo09]

                                                                                                                              Sym Indefinite [BBDDDPSTYrsquo13] [BBDDDPSTYrsquo13]

                                                                                                                              LU[Grsquo97][Trsquo97]

                                                                                                                              [GDXrsquo11][BDLSTrsquo13]

                                                                                                                              [GDXrsquo11][BDLSTrsquo13]

                                                                                                                              [Grsquo97][Trsquo97] [BDLSTrsquo13] [BDLSTrsquo13]

                                                                                                                              QR[EGrsquo98][FWrsquo03]

                                                                                                                              [DGHLrsquo12][BDLSTrsquo13]

                                                                                                                              [FWrsquo03][DGHLrsquo12][BDLSTrsquo13]

                                                                                                                              [EGrsquo98][FWrsquo03][BDLSTrsquo13]

                                                                                                                              [FWrsquo03][BDLSTrsquo13]

                                                                                                                              Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

                                                                                                                              Sym Eig amp SVD [BDDrsquo11][BDKrsquo13] [BDDrsquo11]

                                                                                                                              Non Sym Eig [BDDrsquo11] [BDDrsquo11]

                                                                                                                              Legend[Existing][Ours][Math-Lib][Random]

                                                                                                                              Words (BW) Messages (L) Saving factor

                                                                                                                              BLAS-3 [AGZrsquo94][MTrsquo99][ScaLAPACK][Crsquo69][vGWrsquo97][SDrsquo11] L nP12

                                                                                                                              Cholesky [ScaLAPACK][Trsquo99][SDrsquo11] L nP12

                                                                                                                              Sym Indefinite [BBDDDPSTYrsquo13][ScaLAPACK] [BBDDDPSTYrsquo13] L nP12

                                                                                                                              LU [ScaLAPACK][GDXrsquo11][Trsquo99][SDrsquo11] [GDXrsquo11][Trsquo99][SDrsquo11] L nP12

                                                                                                                              QR [ScaLAPACK][DGHLrsquo12] [Trsquo99] [DGHLrsquo12][Trsquo99] L nP12

                                                                                                                              Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

                                                                                                                              Sym Eig amp SVD [BDDrsquo11][BDKrsquo13][ScaLAPACK] [BDDrsquo11][BDKrsquo13] L nP12

                                                                                                                              Non-Sym Eig [BDDrsquo11] [BDDrsquo11] BW P12 L n

                                                                                                                              Attaining with extra memory 25D M=(cn2P)

                                                                                                                              Attaining the Lower bounds Parallel 2DM=(n2P)(Ignoring poly-log(P) factors words = ( n2 P12) messages = (P12)

                                                                                                                              Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                              ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                              ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                              bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                              bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                              Avoiding Communication in Iterative Linear Algebra

                                                                                                                              bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

                                                                                                                              bull Conjugate Gradients (CG) GMRES Lanczos Arnoldi hellip bull Goal minimize communication

                                                                                                                              ndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

                                                                                                                              bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

                                                                                                                              ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

                                                                                                                              bull Lots of speed up possible (modeled and measured)ndash Price some redundant computationndash Challenges Poor partitioning Preconditioning Num Stability

                                                                                                                              75

                                                                                                                              Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                              ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                              ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                              bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                              bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                              Example The Difficulty of Tuning SpMV

                                                                                                                              bull n = 21200bull nnz = 15 M

                                                                                                                              bull Source NASA structural analysis problem (raefsky)

                                                                                                                              77

                                                                                                                              Example The Difficulty of Tuning

                                                                                                                              bull n = 21200bull nnz = 15 M

                                                                                                                              bull Source NASA structural analysis problem (raefsky)

                                                                                                                              bull 8x8 dense substructure exploit this to limit mem_refs

                                                                                                                              78

                                                                                                                              Speedups on Itanium 2 The Need for Search

                                                                                                                              Reference

                                                                                                                              Best 4x2

                                                                                                                              Mflops

                                                                                                                              Mflops

                                                                                                                              79

                                                                                                                              Register Profile Itanium 2

                                                                                                                              190 Mflops

                                                                                                                              1190 Mflops

                                                                                                                              80

                                                                                                                              Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16

                                                                                                                              Itanium 2 - 33Itanium 1 - 8

                                                                                                                              252 Mflops

                                                                                                                              122 Mflops

                                                                                                                              820 Mflops

                                                                                                                              459 Mflops

                                                                                                                              247 Mflops

                                                                                                                              107 Mflops

                                                                                                                              12 Gflops

                                                                                                                              190 Mflops

                                                                                                                              Another example of tuning challenges for SpMV

                                                                                                                              bull Ex11 matrix (fluid flow)

                                                                                                                              bull More complicated non-zero structure in general

                                                                                                                              bull N = 16614bull NNZ = 11M

                                                                                                                              82

                                                                                                                              Zoom in to top corner

                                                                                                                              bull More complicated non-zero structure in general

                                                                                                                              bull N = 16614bull NNZ = 11M

                                                                                                                              83

                                                                                                                              3x3 blocks look natural buthellip

                                                                                                                              bull Example 3x3 blockingndash Logical grid of 3x3 cells

                                                                                                                              bull But would lead to lots of ldquofill-inrdquo

                                                                                                                              84

                                                                                                                              Extra Work Can Improve Efficiency

                                                                                                                              bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

                                                                                                                              bull On Pentium III 15x speedupndash Actual mflop rate 152 = 225 higher

                                                                                                                              85

                                                                                                                              Source Accelerator Cavity Design Problem (Ko via Husbands)

                                                                                                                              86

                                                                                                                              100x100 Submatrix Along Diagonal

                                                                                                                              Summer School Lecture 787

                                                                                                                              Post-RCM Reordering

                                                                                                                              88

                                                                                                                              Effect of Combined RCM+TSP Reordering

                                                                                                                              Before Green + RedAfter Green + Blue

                                                                                                                              Summer School Lecture 789

                                                                                                                              2x speedups on Pentium 4 Power 4 hellip

                                                                                                                              Summary of Other Performance Optimizations

                                                                                                                              bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

                                                                                                                              bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

                                                                                                                              bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

                                                                                                                              90

                                                                                                                              Optimized Sparse Kernel Interface - OSKI

                                                                                                                              bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

                                                                                                                              bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

                                                                                                                              bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

                                                                                                                              software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

                                                                                                                              91

                                                                                                                              Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                              ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                              ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                              bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                              bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                              93

                                                                                                                              Example Classical Conjugate Gradient (CG)

                                                                                                                              SpMVs and dot products require communication in

                                                                                                                              each iteration

                                                                                                                              via CA Matrix Powers Kernel

                                                                                                                              Global reduction to compute G

                                                                                                                              94

                                                                                                                              Example CA-Conjugate Gradient

                                                                                                                              Local computations within inner loop require

                                                                                                                              no communication

                                                                                                                              Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                              ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                              ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                              bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                              bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                              96

                                                                                                                              Slower convergence due

                                                                                                                              to roundoff

                                                                                                                              Loss of accuracy due to roundoff

                                                                                                                              At s = 16 monomial basis is rank deficient Method breaks down

                                                                                                                              Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

                                                                                                                              CA-CG (monomial)CG

                                                                                                                              machine precision

                                                                                                                              97

                                                                                                                              Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                              ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                              ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                              bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                              bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                              What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

                                                                                                                              bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

                                                                                                                              bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

                                                                                                                              matrices

                                                                                                                              Explicit (O(nnz)) Implicit (o(nnz))

                                                                                                                              Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

                                                                                                                              Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

                                                                                                                              Indices

                                                                                                                              Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                              ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                              ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                              bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                              bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                              101

                                                                                                                              bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

                                                                                                                              ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

                                                                                                                              demanded by customers (construction engineers) otherwise they donrsquot believe results

                                                                                                                              ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

                                                                                                                              ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

                                                                                                                              Reproducible Floating Point Computation

                                                                                                                              Absolute Error for Random Vectors

                                                                                                                              Same magnitude opposite signs

                                                                                                                              Intel MKL non-reproducibility

                                                                                                                              Relative Error for Orthogonal vectors

                                                                                                                              Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

                                                                                                                              Sign notreproducible

                                                                                                                              103

                                                                                                                              bull Consider summation or dot productbull Goals

                                                                                                                              1 Same answer independent of layout processors order of summands

                                                                                                                              2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

                                                                                                                              bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

                                                                                                                              GoalsApproaches for Reproducibility

                                                                                                                              104

                                                                                                                              Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

                                                                                                                              Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

                                                                                                                              Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

                                                                                                                              bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

                                                                                                                              Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

                                                                                                                              bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

                                                                                                                              Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

                                                                                                                              Instruments NEC Nokia NVIDIA Samsung Oracle

                                                                                                                              bull bebopcsberkeleyedu

                                                                                                                              Summary

                                                                                                                              Donrsquot Communichellip

                                                                                                                              106

                                                                                                                              Time to redesign all linear algebra n-body hellip algorithms and software

                                                                                                                              (and compilers)

                                                                                                                              • Implementing Communication-Avoiding Algorithms
                                                                                                                              • Why avoid communication
                                                                                                                              • Goals
                                                                                                                              • Outline
                                                                                                                              • Outline (2)
                                                                                                                              • Lower bound for all ldquon3-likerdquo linear algebra
                                                                                                                              • Lower bound for all ldquon3-likerdquo linear algebra (2)
                                                                                                                              • Lower bound for all ldquon3-likerdquo linear algebra (3)
                                                                                                                              • Limits to parallel scaling (12)
                                                                                                                              • Limits to parallel scaling (22)
                                                                                                                              • Can we attain these lower bounds
                                                                                                                              • Outline (3)
                                                                                                                              • 25D Matrix Multiplication
                                                                                                                              • 25D Matrix Multiplication (2)
                                                                                                                              • 25D Matmul on BGP 16K nodes 64K cores (2)
                                                                                                                              • Perfect Strong Scaling ndash in Time and Energy (12)
                                                                                                                              • Perfect Strong Scaling ndash in Time and Energy (22)
                                                                                                                              • Handling Heterogeneity
                                                                                                                              • Application to Tensor Contractions
                                                                                                                              • C(ijk) = Σm A(ijm)B(mk)
                                                                                                                              • Application to Tensor Contractions (2)
                                                                                                                              • Communication Lower Bounds for Strassen-like matmul algorithms
                                                                                                                              • vs
                                                                                                                              • Slide 26
                                                                                                                              • Strassen-like beyond matmul
                                                                                                                              • Cache and Network Oblivious Algorithms
                                                                                                                              • CARMA Performance Distributed Memory
                                                                                                                              • CARMA Performance Distributed Memory (2)
                                                                                                                              • CARMA Performance Shared Memory
                                                                                                                              • CARMA Performance Shared Memory (2)
                                                                                                                              • Why is CARMA Faster in Shared Memory
                                                                                                                              • Outline (4)
                                                                                                                              • One-sided Factorizations (LU QR) so far
                                                                                                                              • TSQR An Architecture-Dependent Algorithm
                                                                                                                              • Back to LU Using similar idea for TSLU as TSQR Use reduction
                                                                                                                              • Minimizing Communication in TSLU
                                                                                                                              • Making TSLU Numerically Stable
                                                                                                                              • Stability of LU using TSLU CALU
                                                                                                                              • Why is stability of TSLU just a ldquoThmrdquo
                                                                                                                              • Fixing TSLU
                                                                                                                              • 2D CALU with Tournament Pivoting
                                                                                                                              • 25D CALU with Tournament Pivoting (c=4 copies)
                                                                                                                              • Exascale Machine Parameters Source DOE Exascale Workshop
                                                                                                                              • Exascale predicted speedups for Gaussian Elimination 2D CA
                                                                                                                              • 25D vs 2D LU With and Without Pivoting
                                                                                                                              • Other CA algorithms for Ax=b least squares(13)
                                                                                                                              • Other CA algorithms for Ax=b least squares (23)
                                                                                                                              • Other CA algorithms for Ax=b least squares (33)
                                                                                                                              • Outline (5)
                                                                                                                              • What about sparse matrices (13)
                                                                                                                              • Performance of 25D APSP using Kleene
                                                                                                                              • What about sparse matrices (23)
                                                                                                                              • What about sparse matrices (33)
                                                                                                                              • Outline (6)
                                                                                                                              • Symmetric Eigenproblem and SVD
                                                                                                                              • Slide 58
                                                                                                                              • Slide 59
                                                                                                                              • Slide 60
                                                                                                                              • Slide 61
                                                                                                                              • Slide 62
                                                                                                                              • Slide 63
                                                                                                                              • Slide 64
                                                                                                                              • Slide 65
                                                                                                                              • Slide 66
                                                                                                                              • Slide 67
                                                                                                                              • Slide 68
                                                                                                                              • Conventional vs CA - SBR
                                                                                                                              • Speedups of Sym Band Reduction vs DSBTRD
                                                                                                                              • Nonsymmetric Eigenproblem
                                                                                                                              • Attaining the Lower bounds Sequential
                                                                                                                              • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
                                                                                                                              • Outline (7)
                                                                                                                              • Avoiding Communication in Iterative Linear Algebra
                                                                                                                              • Outline (8)
                                                                                                                              • Example The Difficulty of Tuning SpMV
                                                                                                                              • Example The Difficulty of Tuning
                                                                                                                              • Speedups on Itanium 2 The Need for Search
                                                                                                                              • Register Profile Itanium 2
                                                                                                                              • Register Profiles IBM and Intel IA-64
                                                                                                                              • Another example of tuning challenges for SpMV
                                                                                                                              • Zoom in to top corner
                                                                                                                              • 3x3 blocks look natural buthellip
                                                                                                                              • Extra Work Can Improve Efficiency
                                                                                                                              • Slide 86
                                                                                                                              • Slide 87
                                                                                                                              • Slide 88
                                                                                                                              • Slide 89
                                                                                                                              • Summary of Other Performance Optimizations
                                                                                                                              • Optimized Sparse Kernel Interface - OSKI
                                                                                                                              • Outline (9)
                                                                                                                              • Example Classical Conjugate Gradient (CG)
                                                                                                                              • Example CA-Conjugate Gradient
                                                                                                                              • Outline (10)
                                                                                                                              • Slide 96
                                                                                                                              • Slide 97
                                                                                                                              • Outline (11)
                                                                                                                              • What is a ldquosparse matrixrdquo
                                                                                                                              • Outline (12)
                                                                                                                              • Reproducible Floating Point Computation
                                                                                                                              • Intel MKL non-reproducibility
                                                                                                                              • GoalsApproaches for Reproducibility
                                                                                                                              • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
                                                                                                                              • Collaborators and Supporters
                                                                                                                              • Summary

                                                                                                                                1

                                                                                                                                1

                                                                                                                                2

                                                                                                                                2

                                                                                                                                3

                                                                                                                                3

                                                                                                                                4

                                                                                                                                4

                                                                                                                                5

                                                                                                                                5

                                                                                                                                Q1

                                                                                                                                Q1T

                                                                                                                                Q2

                                                                                                                                Q2T

                                                                                                                                Q3

                                                                                                                                Q3T

                                                                                                                                Q4

                                                                                                                                Q4T

                                                                                                                                b+1

                                                                                                                                b+1

                                                                                                                                d+1

                                                                                                                                d+1

                                                                                                                                c

                                                                                                                                c

                                                                                                                                d+c

                                                                                                                                d+c

                                                                                                                                d+c

                                                                                                                                d+c

                                                                                                                                b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                                                                Successive Band Reduction (BischofLangSun)

                                                                                                                                1

                                                                                                                                1

                                                                                                                                2

                                                                                                                                2

                                                                                                                                3

                                                                                                                                3

                                                                                                                                4

                                                                                                                                4

                                                                                                                                5

                                                                                                                                5

                                                                                                                                Q5T

                                                                                                                                Q1

                                                                                                                                Q1T

                                                                                                                                Q2

                                                                                                                                Q2T

                                                                                                                                Q3

                                                                                                                                Q3T

                                                                                                                                Q5

                                                                                                                                Q4

                                                                                                                                Q4T

                                                                                                                                b+1

                                                                                                                                b+1

                                                                                                                                d+1

                                                                                                                                d+1

                                                                                                                                c

                                                                                                                                c

                                                                                                                                d+c

                                                                                                                                d+c

                                                                                                                                d+c

                                                                                                                                d+c

                                                                                                                                b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                                                                Successive Band Reduction (BischofLangSun)

                                                                                                                                1

                                                                                                                                1

                                                                                                                                2

                                                                                                                                2

                                                                                                                                3

                                                                                                                                3

                                                                                                                                4

                                                                                                                                4

                                                                                                                                5

                                                                                                                                5

                                                                                                                                6

                                                                                                                                6

                                                                                                                                Q5T

                                                                                                                                Q1

                                                                                                                                Q1T

                                                                                                                                Q2

                                                                                                                                Q2T

                                                                                                                                Q3

                                                                                                                                Q3T

                                                                                                                                Q5

                                                                                                                                Q4

                                                                                                                                Q4T

                                                                                                                                b+1

                                                                                                                                b+1

                                                                                                                                d+1

                                                                                                                                d+1

                                                                                                                                c

                                                                                                                                c

                                                                                                                                d+c

                                                                                                                                d+c

                                                                                                                                d+c

                                                                                                                                d+c

                                                                                                                                b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                                                                Successive Band Reduction (BischofLangSun)

                                                                                                                                Conventional vs CA - SBR

                                                                                                                                Conventional Communication-Avoiding

                                                                                                                                Touch all data 4 times Touch all data once

                                                                                                                                >
                                                                                                                                >

                                                                                                                                Speedups of Sym Band Reductionvs DSBTRD

                                                                                                                                bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

                                                                                                                                bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

                                                                                                                                bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

                                                                                                                                bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

                                                                                                                                bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

                                                                                                                                Nonsymmetric Eigenproblem

                                                                                                                                bull No apparent way to modify standard algorithmbull Instead Spectral Divide-and-Conquer

                                                                                                                                ndash Find orthogonal matrix Q whose leading columns span an invariant subspace of A

                                                                                                                                ndash QTAQ will be block upper triangularndash Apply recursively to A11 A22

                                                                                                                                ndash Depends on randomization1 Randomized Rank Revealing QR decomposition2 Randomized location to try splitting spectrum

                                                                                                                                A11 A12

                                                                                                                                ε A22

                                                                                                                                Attaining the Lower bounds SequentialLegend[Existing][Ours][Math-Lib][Random]

                                                                                                                                Two Levels Memory Hierarchy

                                                                                                                                Words Messages Words Messages

                                                                                                                                BLAS-3 [FLPRrsquo99][BDLSTrsquo13][MKL etc] [FLPRrsquo99][BDLSTrsquo13][MKL etc]

                                                                                                                                Cholesky[Grsquo97][APrsquo00]

                                                                                                                                [LAPACK][BDHSrsquo09]

                                                                                                                                [Grsquo97][APrsquo00][BDHSrsquo09] [Grsquo97][APrsquo00][BDHSrsquo09]

                                                                                                                                Sym Indefinite [BBDDDPSTYrsquo13] [BBDDDPSTYrsquo13]

                                                                                                                                LU[Grsquo97][Trsquo97]

                                                                                                                                [GDXrsquo11][BDLSTrsquo13]

                                                                                                                                [GDXrsquo11][BDLSTrsquo13]

                                                                                                                                [Grsquo97][Trsquo97] [BDLSTrsquo13] [BDLSTrsquo13]

                                                                                                                                QR[EGrsquo98][FWrsquo03]

                                                                                                                                [DGHLrsquo12][BDLSTrsquo13]

                                                                                                                                [FWrsquo03][DGHLrsquo12][BDLSTrsquo13]

                                                                                                                                [EGrsquo98][FWrsquo03][BDLSTrsquo13]

                                                                                                                                [FWrsquo03][BDLSTrsquo13]

                                                                                                                                Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

                                                                                                                                Sym Eig amp SVD [BDDrsquo11][BDKrsquo13] [BDDrsquo11]

                                                                                                                                Non Sym Eig [BDDrsquo11] [BDDrsquo11]

                                                                                                                                Legend[Existing][Ours][Math-Lib][Random]

                                                                                                                                Words (BW) Messages (L) Saving factor

                                                                                                                                BLAS-3 [AGZrsquo94][MTrsquo99][ScaLAPACK][Crsquo69][vGWrsquo97][SDrsquo11] L nP12

                                                                                                                                Cholesky [ScaLAPACK][Trsquo99][SDrsquo11] L nP12

                                                                                                                                Sym Indefinite [BBDDDPSTYrsquo13][ScaLAPACK] [BBDDDPSTYrsquo13] L nP12

                                                                                                                                LU [ScaLAPACK][GDXrsquo11][Trsquo99][SDrsquo11] [GDXrsquo11][Trsquo99][SDrsquo11] L nP12

                                                                                                                                QR [ScaLAPACK][DGHLrsquo12] [Trsquo99] [DGHLrsquo12][Trsquo99] L nP12

                                                                                                                                Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

                                                                                                                                Sym Eig amp SVD [BDDrsquo11][BDKrsquo13][ScaLAPACK] [BDDrsquo11][BDKrsquo13] L nP12

                                                                                                                                Non-Sym Eig [BDDrsquo11] [BDDrsquo11] BW P12 L n

                                                                                                                                Attaining with extra memory 25D M=(cn2P)

                                                                                                                                Attaining the Lower bounds Parallel 2DM=(n2P)(Ignoring poly-log(P) factors words = ( n2 P12) messages = (P12)

                                                                                                                                Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                                ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                                ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                                bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                                bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                                Avoiding Communication in Iterative Linear Algebra

                                                                                                                                bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

                                                                                                                                bull Conjugate Gradients (CG) GMRES Lanczos Arnoldi hellip bull Goal minimize communication

                                                                                                                                ndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

                                                                                                                                bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

                                                                                                                                ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

                                                                                                                                bull Lots of speed up possible (modeled and measured)ndash Price some redundant computationndash Challenges Poor partitioning Preconditioning Num Stability

                                                                                                                                75

                                                                                                                                Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                                ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                                ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                                bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                                bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                                Example The Difficulty of Tuning SpMV

                                                                                                                                bull n = 21200bull nnz = 15 M

                                                                                                                                bull Source NASA structural analysis problem (raefsky)

                                                                                                                                77

                                                                                                                                Example The Difficulty of Tuning

                                                                                                                                bull n = 21200bull nnz = 15 M

                                                                                                                                bull Source NASA structural analysis problem (raefsky)

                                                                                                                                bull 8x8 dense substructure exploit this to limit mem_refs

                                                                                                                                78

                                                                                                                                Speedups on Itanium 2 The Need for Search

                                                                                                                                Reference

                                                                                                                                Best 4x2

                                                                                                                                Mflops

                                                                                                                                Mflops

                                                                                                                                79

                                                                                                                                Register Profile Itanium 2

                                                                                                                                190 Mflops

                                                                                                                                1190 Mflops

                                                                                                                                80

                                                                                                                                Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16

                                                                                                                                Itanium 2 - 33Itanium 1 - 8

                                                                                                                                252 Mflops

                                                                                                                                122 Mflops

                                                                                                                                820 Mflops

                                                                                                                                459 Mflops

                                                                                                                                247 Mflops

                                                                                                                                107 Mflops

                                                                                                                                12 Gflops

                                                                                                                                190 Mflops

                                                                                                                                Another example of tuning challenges for SpMV

                                                                                                                                bull Ex11 matrix (fluid flow)

                                                                                                                                bull More complicated non-zero structure in general

                                                                                                                                bull N = 16614bull NNZ = 11M

                                                                                                                                82

                                                                                                                                Zoom in to top corner

                                                                                                                                bull More complicated non-zero structure in general

                                                                                                                                bull N = 16614bull NNZ = 11M

                                                                                                                                83

                                                                                                                                3x3 blocks look natural buthellip

                                                                                                                                bull Example 3x3 blockingndash Logical grid of 3x3 cells

                                                                                                                                bull But would lead to lots of ldquofill-inrdquo

                                                                                                                                84

                                                                                                                                Extra Work Can Improve Efficiency

                                                                                                                                bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

                                                                                                                                bull On Pentium III 15x speedupndash Actual mflop rate 152 = 225 higher

                                                                                                                                85

                                                                                                                                Source Accelerator Cavity Design Problem (Ko via Husbands)

                                                                                                                                86

                                                                                                                                100x100 Submatrix Along Diagonal

                                                                                                                                Summer School Lecture 787

                                                                                                                                Post-RCM Reordering

                                                                                                                                88

                                                                                                                                Effect of Combined RCM+TSP Reordering

                                                                                                                                Before Green + RedAfter Green + Blue

                                                                                                                                Summer School Lecture 789

                                                                                                                                2x speedups on Pentium 4 Power 4 hellip

                                                                                                                                Summary of Other Performance Optimizations

                                                                                                                                bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

                                                                                                                                bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

                                                                                                                                bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

                                                                                                                                90

                                                                                                                                Optimized Sparse Kernel Interface - OSKI

                                                                                                                                bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

                                                                                                                                bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

                                                                                                                                bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

                                                                                                                                software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

                                                                                                                                91

                                                                                                                                Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                                ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                                ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                                bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                                bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                                93

                                                                                                                                Example Classical Conjugate Gradient (CG)

                                                                                                                                SpMVs and dot products require communication in

                                                                                                                                each iteration

                                                                                                                                via CA Matrix Powers Kernel

                                                                                                                                Global reduction to compute G

                                                                                                                                94

                                                                                                                                Example CA-Conjugate Gradient

                                                                                                                                Local computations within inner loop require

                                                                                                                                no communication

                                                                                                                                Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                                ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                                ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                                bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                                bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                                96

                                                                                                                                Slower convergence due

                                                                                                                                to roundoff

                                                                                                                                Loss of accuracy due to roundoff

                                                                                                                                At s = 16 monomial basis is rank deficient Method breaks down

                                                                                                                                Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

                                                                                                                                CA-CG (monomial)CG

                                                                                                                                machine precision

                                                                                                                                97

                                                                                                                                Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                                ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                                ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                                bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                                bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                                What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

                                                                                                                                bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

                                                                                                                                bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

                                                                                                                                matrices

                                                                                                                                Explicit (O(nnz)) Implicit (o(nnz))

                                                                                                                                Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

                                                                                                                                Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

                                                                                                                                Indices

                                                                                                                                Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                                ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                                ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                                bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                                bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                                101

                                                                                                                                bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

                                                                                                                                ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

                                                                                                                                demanded by customers (construction engineers) otherwise they donrsquot believe results

                                                                                                                                ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

                                                                                                                                ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

                                                                                                                                Reproducible Floating Point Computation

                                                                                                                                Absolute Error for Random Vectors

                                                                                                                                Same magnitude opposite signs

                                                                                                                                Intel MKL non-reproducibility

                                                                                                                                Relative Error for Orthogonal vectors

                                                                                                                                Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

                                                                                                                                Sign notreproducible

                                                                                                                                103

                                                                                                                                bull Consider summation or dot productbull Goals

                                                                                                                                1 Same answer independent of layout processors order of summands

                                                                                                                                2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

                                                                                                                                bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

                                                                                                                                GoalsApproaches for Reproducibility

                                                                                                                                104

                                                                                                                                Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

                                                                                                                                Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

                                                                                                                                Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

                                                                                                                                bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

                                                                                                                                Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

                                                                                                                                bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

                                                                                                                                Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

                                                                                                                                Instruments NEC Nokia NVIDIA Samsung Oracle

                                                                                                                                bull bebopcsberkeleyedu

                                                                                                                                Summary

                                                                                                                                Donrsquot Communichellip

                                                                                                                                106

                                                                                                                                Time to redesign all linear algebra n-body hellip algorithms and software

                                                                                                                                (and compilers)

                                                                                                                                • Implementing Communication-Avoiding Algorithms
                                                                                                                                • Why avoid communication
                                                                                                                                • Goals
                                                                                                                                • Outline
                                                                                                                                • Outline (2)
                                                                                                                                • Lower bound for all ldquon3-likerdquo linear algebra
                                                                                                                                • Lower bound for all ldquon3-likerdquo linear algebra (2)
                                                                                                                                • Lower bound for all ldquon3-likerdquo linear algebra (3)
                                                                                                                                • Limits to parallel scaling (12)
                                                                                                                                • Limits to parallel scaling (22)
                                                                                                                                • Can we attain these lower bounds
                                                                                                                                • Outline (3)
                                                                                                                                • 25D Matrix Multiplication
                                                                                                                                • 25D Matrix Multiplication (2)
                                                                                                                                • 25D Matmul on BGP 16K nodes 64K cores (2)
                                                                                                                                • Perfect Strong Scaling ndash in Time and Energy (12)
                                                                                                                                • Perfect Strong Scaling ndash in Time and Energy (22)
                                                                                                                                • Handling Heterogeneity
                                                                                                                                • Application to Tensor Contractions
                                                                                                                                • C(ijk) = Σm A(ijm)B(mk)
                                                                                                                                • Application to Tensor Contractions (2)
                                                                                                                                • Communication Lower Bounds for Strassen-like matmul algorithms
                                                                                                                                • vs
                                                                                                                                • Slide 26
                                                                                                                                • Strassen-like beyond matmul
                                                                                                                                • Cache and Network Oblivious Algorithms
                                                                                                                                • CARMA Performance Distributed Memory
                                                                                                                                • CARMA Performance Distributed Memory (2)
                                                                                                                                • CARMA Performance Shared Memory
                                                                                                                                • CARMA Performance Shared Memory (2)
                                                                                                                                • Why is CARMA Faster in Shared Memory
                                                                                                                                • Outline (4)
                                                                                                                                • One-sided Factorizations (LU QR) so far
                                                                                                                                • TSQR An Architecture-Dependent Algorithm
                                                                                                                                • Back to LU Using similar idea for TSLU as TSQR Use reduction
                                                                                                                                • Minimizing Communication in TSLU
                                                                                                                                • Making TSLU Numerically Stable
                                                                                                                                • Stability of LU using TSLU CALU
                                                                                                                                • Why is stability of TSLU just a ldquoThmrdquo
                                                                                                                                • Fixing TSLU
                                                                                                                                • 2D CALU with Tournament Pivoting
                                                                                                                                • 25D CALU with Tournament Pivoting (c=4 copies)
                                                                                                                                • Exascale Machine Parameters Source DOE Exascale Workshop
                                                                                                                                • Exascale predicted speedups for Gaussian Elimination 2D CA
                                                                                                                                • 25D vs 2D LU With and Without Pivoting
                                                                                                                                • Other CA algorithms for Ax=b least squares(13)
                                                                                                                                • Other CA algorithms for Ax=b least squares (23)
                                                                                                                                • Other CA algorithms for Ax=b least squares (33)
                                                                                                                                • Outline (5)
                                                                                                                                • What about sparse matrices (13)
                                                                                                                                • Performance of 25D APSP using Kleene
                                                                                                                                • What about sparse matrices (23)
                                                                                                                                • What about sparse matrices (33)
                                                                                                                                • Outline (6)
                                                                                                                                • Symmetric Eigenproblem and SVD
                                                                                                                                • Slide 58
                                                                                                                                • Slide 59
                                                                                                                                • Slide 60
                                                                                                                                • Slide 61
                                                                                                                                • Slide 62
                                                                                                                                • Slide 63
                                                                                                                                • Slide 64
                                                                                                                                • Slide 65
                                                                                                                                • Slide 66
                                                                                                                                • Slide 67
                                                                                                                                • Slide 68
                                                                                                                                • Conventional vs CA - SBR
                                                                                                                                • Speedups of Sym Band Reduction vs DSBTRD
                                                                                                                                • Nonsymmetric Eigenproblem
                                                                                                                                • Attaining the Lower bounds Sequential
                                                                                                                                • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
                                                                                                                                • Outline (7)
                                                                                                                                • Avoiding Communication in Iterative Linear Algebra
                                                                                                                                • Outline (8)
                                                                                                                                • Example The Difficulty of Tuning SpMV
                                                                                                                                • Example The Difficulty of Tuning
                                                                                                                                • Speedups on Itanium 2 The Need for Search
                                                                                                                                • Register Profile Itanium 2
                                                                                                                                • Register Profiles IBM and Intel IA-64
                                                                                                                                • Another example of tuning challenges for SpMV
                                                                                                                                • Zoom in to top corner
                                                                                                                                • 3x3 blocks look natural buthellip
                                                                                                                                • Extra Work Can Improve Efficiency
                                                                                                                                • Slide 86
                                                                                                                                • Slide 87
                                                                                                                                • Slide 88
                                                                                                                                • Slide 89
                                                                                                                                • Summary of Other Performance Optimizations
                                                                                                                                • Optimized Sparse Kernel Interface - OSKI
                                                                                                                                • Outline (9)
                                                                                                                                • Example Classical Conjugate Gradient (CG)
                                                                                                                                • Example CA-Conjugate Gradient
                                                                                                                                • Outline (10)
                                                                                                                                • Slide 96
                                                                                                                                • Slide 97
                                                                                                                                • Outline (11)
                                                                                                                                • What is a ldquosparse matrixrdquo
                                                                                                                                • Outline (12)
                                                                                                                                • Reproducible Floating Point Computation
                                                                                                                                • Intel MKL non-reproducibility
                                                                                                                                • GoalsApproaches for Reproducibility
                                                                                                                                • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
                                                                                                                                • Collaborators and Supporters
                                                                                                                                • Summary

                                                                                                                                  1

                                                                                                                                  1

                                                                                                                                  2

                                                                                                                                  2

                                                                                                                                  3

                                                                                                                                  3

                                                                                                                                  4

                                                                                                                                  4

                                                                                                                                  5

                                                                                                                                  5

                                                                                                                                  Q5T

                                                                                                                                  Q1

                                                                                                                                  Q1T

                                                                                                                                  Q2

                                                                                                                                  Q2T

                                                                                                                                  Q3

                                                                                                                                  Q3T

                                                                                                                                  Q5

                                                                                                                                  Q4

                                                                                                                                  Q4T

                                                                                                                                  b+1

                                                                                                                                  b+1

                                                                                                                                  d+1

                                                                                                                                  d+1

                                                                                                                                  c

                                                                                                                                  c

                                                                                                                                  d+c

                                                                                                                                  d+c

                                                                                                                                  d+c

                                                                                                                                  d+c

                                                                                                                                  b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                                                                  Successive Band Reduction (BischofLangSun)

                                                                                                                                  1

                                                                                                                                  1

                                                                                                                                  2

                                                                                                                                  2

                                                                                                                                  3

                                                                                                                                  3

                                                                                                                                  4

                                                                                                                                  4

                                                                                                                                  5

                                                                                                                                  5

                                                                                                                                  6

                                                                                                                                  6

                                                                                                                                  Q5T

                                                                                                                                  Q1

                                                                                                                                  Q1T

                                                                                                                                  Q2

                                                                                                                                  Q2T

                                                                                                                                  Q3

                                                                                                                                  Q3T

                                                                                                                                  Q5

                                                                                                                                  Q4

                                                                                                                                  Q4T

                                                                                                                                  b+1

                                                                                                                                  b+1

                                                                                                                                  d+1

                                                                                                                                  d+1

                                                                                                                                  c

                                                                                                                                  c

                                                                                                                                  d+c

                                                                                                                                  d+c

                                                                                                                                  d+c

                                                                                                                                  d+c

                                                                                                                                  b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                                                                  Successive Band Reduction (BischofLangSun)

                                                                                                                                  Conventional vs CA - SBR

                                                                                                                                  Conventional Communication-Avoiding

                                                                                                                                  Touch all data 4 times Touch all data once

                                                                                                                                  >
                                                                                                                                  >

                                                                                                                                  Speedups of Sym Band Reductionvs DSBTRD

                                                                                                                                  bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

                                                                                                                                  bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

                                                                                                                                  bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

                                                                                                                                  bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

                                                                                                                                  bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

                                                                                                                                  Nonsymmetric Eigenproblem

                                                                                                                                  bull No apparent way to modify standard algorithmbull Instead Spectral Divide-and-Conquer

                                                                                                                                  ndash Find orthogonal matrix Q whose leading columns span an invariant subspace of A

                                                                                                                                  ndash QTAQ will be block upper triangularndash Apply recursively to A11 A22

                                                                                                                                  ndash Depends on randomization1 Randomized Rank Revealing QR decomposition2 Randomized location to try splitting spectrum

                                                                                                                                  A11 A12

                                                                                                                                  ε A22

                                                                                                                                  Attaining the Lower bounds SequentialLegend[Existing][Ours][Math-Lib][Random]

                                                                                                                                  Two Levels Memory Hierarchy

                                                                                                                                  Words Messages Words Messages

                                                                                                                                  BLAS-3 [FLPRrsquo99][BDLSTrsquo13][MKL etc] [FLPRrsquo99][BDLSTrsquo13][MKL etc]

                                                                                                                                  Cholesky[Grsquo97][APrsquo00]

                                                                                                                                  [LAPACK][BDHSrsquo09]

                                                                                                                                  [Grsquo97][APrsquo00][BDHSrsquo09] [Grsquo97][APrsquo00][BDHSrsquo09]

                                                                                                                                  Sym Indefinite [BBDDDPSTYrsquo13] [BBDDDPSTYrsquo13]

                                                                                                                                  LU[Grsquo97][Trsquo97]

                                                                                                                                  [GDXrsquo11][BDLSTrsquo13]

                                                                                                                                  [GDXrsquo11][BDLSTrsquo13]

                                                                                                                                  [Grsquo97][Trsquo97] [BDLSTrsquo13] [BDLSTrsquo13]

                                                                                                                                  QR[EGrsquo98][FWrsquo03]

                                                                                                                                  [DGHLrsquo12][BDLSTrsquo13]

                                                                                                                                  [FWrsquo03][DGHLrsquo12][BDLSTrsquo13]

                                                                                                                                  [EGrsquo98][FWrsquo03][BDLSTrsquo13]

                                                                                                                                  [FWrsquo03][BDLSTrsquo13]

                                                                                                                                  Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

                                                                                                                                  Sym Eig amp SVD [BDDrsquo11][BDKrsquo13] [BDDrsquo11]

                                                                                                                                  Non Sym Eig [BDDrsquo11] [BDDrsquo11]

                                                                                                                                  Legend[Existing][Ours][Math-Lib][Random]

                                                                                                                                  Words (BW) Messages (L) Saving factor

                                                                                                                                  BLAS-3 [AGZrsquo94][MTrsquo99][ScaLAPACK][Crsquo69][vGWrsquo97][SDrsquo11] L nP12

                                                                                                                                  Cholesky [ScaLAPACK][Trsquo99][SDrsquo11] L nP12

                                                                                                                                  Sym Indefinite [BBDDDPSTYrsquo13][ScaLAPACK] [BBDDDPSTYrsquo13] L nP12

                                                                                                                                  LU [ScaLAPACK][GDXrsquo11][Trsquo99][SDrsquo11] [GDXrsquo11][Trsquo99][SDrsquo11] L nP12

                                                                                                                                  QR [ScaLAPACK][DGHLrsquo12] [Trsquo99] [DGHLrsquo12][Trsquo99] L nP12

                                                                                                                                  Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

                                                                                                                                  Sym Eig amp SVD [BDDrsquo11][BDKrsquo13][ScaLAPACK] [BDDrsquo11][BDKrsquo13] L nP12

                                                                                                                                  Non-Sym Eig [BDDrsquo11] [BDDrsquo11] BW P12 L n

                                                                                                                                  Attaining with extra memory 25D M=(cn2P)

                                                                                                                                  Attaining the Lower bounds Parallel 2DM=(n2P)(Ignoring poly-log(P) factors words = ( n2 P12) messages = (P12)

                                                                                                                                  Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                                  ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                                  ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                                  bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                                  bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                                  Avoiding Communication in Iterative Linear Algebra

                                                                                                                                  bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

                                                                                                                                  bull Conjugate Gradients (CG) GMRES Lanczos Arnoldi hellip bull Goal minimize communication

                                                                                                                                  ndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

                                                                                                                                  bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

                                                                                                                                  ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

                                                                                                                                  bull Lots of speed up possible (modeled and measured)ndash Price some redundant computationndash Challenges Poor partitioning Preconditioning Num Stability

                                                                                                                                  75

                                                                                                                                  Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                                  ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                                  ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                                  bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                                  bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                                  Example The Difficulty of Tuning SpMV

                                                                                                                                  bull n = 21200bull nnz = 15 M

                                                                                                                                  bull Source NASA structural analysis problem (raefsky)

                                                                                                                                  77

                                                                                                                                  Example The Difficulty of Tuning

                                                                                                                                  bull n = 21200bull nnz = 15 M

                                                                                                                                  bull Source NASA structural analysis problem (raefsky)

                                                                                                                                  bull 8x8 dense substructure exploit this to limit mem_refs

                                                                                                                                  78

                                                                                                                                  Speedups on Itanium 2 The Need for Search

                                                                                                                                  Reference

                                                                                                                                  Best 4x2

                                                                                                                                  Mflops

                                                                                                                                  Mflops

                                                                                                                                  79

                                                                                                                                  Register Profile Itanium 2

                                                                                                                                  190 Mflops

                                                                                                                                  1190 Mflops

                                                                                                                                  80

                                                                                                                                  Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16

                                                                                                                                  Itanium 2 - 33Itanium 1 - 8

                                                                                                                                  252 Mflops

                                                                                                                                  122 Mflops

                                                                                                                                  820 Mflops

                                                                                                                                  459 Mflops

                                                                                                                                  247 Mflops

                                                                                                                                  107 Mflops

                                                                                                                                  12 Gflops

                                                                                                                                  190 Mflops

                                                                                                                                  Another example of tuning challenges for SpMV

                                                                                                                                  bull Ex11 matrix (fluid flow)

                                                                                                                                  bull More complicated non-zero structure in general

                                                                                                                                  bull N = 16614bull NNZ = 11M

                                                                                                                                  82

                                                                                                                                  Zoom in to top corner

                                                                                                                                  bull More complicated non-zero structure in general

                                                                                                                                  bull N = 16614bull NNZ = 11M

                                                                                                                                  83

                                                                                                                                  3x3 blocks look natural buthellip

                                                                                                                                  bull Example 3x3 blockingndash Logical grid of 3x3 cells

                                                                                                                                  bull But would lead to lots of ldquofill-inrdquo

                                                                                                                                  84

                                                                                                                                  Extra Work Can Improve Efficiency

                                                                                                                                  bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

                                                                                                                                  bull On Pentium III 15x speedupndash Actual mflop rate 152 = 225 higher

                                                                                                                                  85

                                                                                                                                  Source Accelerator Cavity Design Problem (Ko via Husbands)

                                                                                                                                  86

                                                                                                                                  100x100 Submatrix Along Diagonal

                                                                                                                                  Summer School Lecture 787

                                                                                                                                  Post-RCM Reordering

                                                                                                                                  88

                                                                                                                                  Effect of Combined RCM+TSP Reordering

                                                                                                                                  Before Green + RedAfter Green + Blue

                                                                                                                                  Summer School Lecture 789

                                                                                                                                  2x speedups on Pentium 4 Power 4 hellip

                                                                                                                                  Summary of Other Performance Optimizations

                                                                                                                                  bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

                                                                                                                                  bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

                                                                                                                                  bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

                                                                                                                                  90

                                                                                                                                  Optimized Sparse Kernel Interface - OSKI

                                                                                                                                  bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

                                                                                                                                  bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

                                                                                                                                  bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

                                                                                                                                  software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

                                                                                                                                  91

                                                                                                                                  Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                                  ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                                  ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                                  bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                                  bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                                  93

                                                                                                                                  Example Classical Conjugate Gradient (CG)

                                                                                                                                  SpMVs and dot products require communication in

                                                                                                                                  each iteration

                                                                                                                                  via CA Matrix Powers Kernel

                                                                                                                                  Global reduction to compute G

                                                                                                                                  94

                                                                                                                                  Example CA-Conjugate Gradient

                                                                                                                                  Local computations within inner loop require

                                                                                                                                  no communication

                                                                                                                                  Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                                  ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                                  ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                                  bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                                  bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                                  96

                                                                                                                                  Slower convergence due

                                                                                                                                  to roundoff

                                                                                                                                  Loss of accuracy due to roundoff

                                                                                                                                  At s = 16 monomial basis is rank deficient Method breaks down

                                                                                                                                  Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

                                                                                                                                  CA-CG (monomial)CG

                                                                                                                                  machine precision

                                                                                                                                  97

                                                                                                                                  Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                                  ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                                  ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                                  bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                                  bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                                  What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

                                                                                                                                  bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

                                                                                                                                  bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

                                                                                                                                  matrices

                                                                                                                                  Explicit (O(nnz)) Implicit (o(nnz))

                                                                                                                                  Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

                                                                                                                                  Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

                                                                                                                                  Indices

                                                                                                                                  Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                                  ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                                  ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                                  bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                                  bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                                  101

                                                                                                                                  bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

                                                                                                                                  ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

                                                                                                                                  demanded by customers (construction engineers) otherwise they donrsquot believe results

                                                                                                                                  ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

                                                                                                                                  ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

                                                                                                                                  Reproducible Floating Point Computation

                                                                                                                                  Absolute Error for Random Vectors

                                                                                                                                  Same magnitude opposite signs

                                                                                                                                  Intel MKL non-reproducibility

                                                                                                                                  Relative Error for Orthogonal vectors

                                                                                                                                  Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

                                                                                                                                  Sign notreproducible

                                                                                                                                  103

                                                                                                                                  bull Consider summation or dot productbull Goals

                                                                                                                                  1 Same answer independent of layout processors order of summands

                                                                                                                                  2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

                                                                                                                                  bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

                                                                                                                                  GoalsApproaches for Reproducibility

                                                                                                                                  104

                                                                                                                                  Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

                                                                                                                                  Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

                                                                                                                                  Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

                                                                                                                                  bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

                                                                                                                                  Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

                                                                                                                                  bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

                                                                                                                                  Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

                                                                                                                                  Instruments NEC Nokia NVIDIA Samsung Oracle

                                                                                                                                  bull bebopcsberkeleyedu

                                                                                                                                  Summary

                                                                                                                                  Donrsquot Communichellip

                                                                                                                                  106

                                                                                                                                  Time to redesign all linear algebra n-body hellip algorithms and software

                                                                                                                                  (and compilers)

                                                                                                                                  • Implementing Communication-Avoiding Algorithms
                                                                                                                                  • Why avoid communication
                                                                                                                                  • Goals
                                                                                                                                  • Outline
                                                                                                                                  • Outline (2)
                                                                                                                                  • Lower bound for all ldquon3-likerdquo linear algebra
                                                                                                                                  • Lower bound for all ldquon3-likerdquo linear algebra (2)
                                                                                                                                  • Lower bound for all ldquon3-likerdquo linear algebra (3)
                                                                                                                                  • Limits to parallel scaling (12)
                                                                                                                                  • Limits to parallel scaling (22)
                                                                                                                                  • Can we attain these lower bounds
                                                                                                                                  • Outline (3)
                                                                                                                                  • 25D Matrix Multiplication
                                                                                                                                  • 25D Matrix Multiplication (2)
                                                                                                                                  • 25D Matmul on BGP 16K nodes 64K cores (2)
                                                                                                                                  • Perfect Strong Scaling ndash in Time and Energy (12)
                                                                                                                                  • Perfect Strong Scaling ndash in Time and Energy (22)
                                                                                                                                  • Handling Heterogeneity
                                                                                                                                  • Application to Tensor Contractions
                                                                                                                                  • C(ijk) = Σm A(ijm)B(mk)
                                                                                                                                  • Application to Tensor Contractions (2)
                                                                                                                                  • Communication Lower Bounds for Strassen-like matmul algorithms
                                                                                                                                  • vs
                                                                                                                                  • Slide 26
                                                                                                                                  • Strassen-like beyond matmul
                                                                                                                                  • Cache and Network Oblivious Algorithms
                                                                                                                                  • CARMA Performance Distributed Memory
                                                                                                                                  • CARMA Performance Distributed Memory (2)
                                                                                                                                  • CARMA Performance Shared Memory
                                                                                                                                  • CARMA Performance Shared Memory (2)
                                                                                                                                  • Why is CARMA Faster in Shared Memory
                                                                                                                                  • Outline (4)
                                                                                                                                  • One-sided Factorizations (LU QR) so far
                                                                                                                                  • TSQR An Architecture-Dependent Algorithm
                                                                                                                                  • Back to LU Using similar idea for TSLU as TSQR Use reduction
                                                                                                                                  • Minimizing Communication in TSLU
                                                                                                                                  • Making TSLU Numerically Stable
                                                                                                                                  • Stability of LU using TSLU CALU
                                                                                                                                  • Why is stability of TSLU just a ldquoThmrdquo
                                                                                                                                  • Fixing TSLU
                                                                                                                                  • 2D CALU with Tournament Pivoting
                                                                                                                                  • 25D CALU with Tournament Pivoting (c=4 copies)
                                                                                                                                  • Exascale Machine Parameters Source DOE Exascale Workshop
                                                                                                                                  • Exascale predicted speedups for Gaussian Elimination 2D CA
                                                                                                                                  • 25D vs 2D LU With and Without Pivoting
                                                                                                                                  • Other CA algorithms for Ax=b least squares(13)
                                                                                                                                  • Other CA algorithms for Ax=b least squares (23)
                                                                                                                                  • Other CA algorithms for Ax=b least squares (33)
                                                                                                                                  • Outline (5)
                                                                                                                                  • What about sparse matrices (13)
                                                                                                                                  • Performance of 25D APSP using Kleene
                                                                                                                                  • What about sparse matrices (23)
                                                                                                                                  • What about sparse matrices (33)
                                                                                                                                  • Outline (6)
                                                                                                                                  • Symmetric Eigenproblem and SVD
                                                                                                                                  • Slide 58
                                                                                                                                  • Slide 59
                                                                                                                                  • Slide 60
                                                                                                                                  • Slide 61
                                                                                                                                  • Slide 62
                                                                                                                                  • Slide 63
                                                                                                                                  • Slide 64
                                                                                                                                  • Slide 65
                                                                                                                                  • Slide 66
                                                                                                                                  • Slide 67
                                                                                                                                  • Slide 68
                                                                                                                                  • Conventional vs CA - SBR
                                                                                                                                  • Speedups of Sym Band Reduction vs DSBTRD
                                                                                                                                  • Nonsymmetric Eigenproblem
                                                                                                                                  • Attaining the Lower bounds Sequential
                                                                                                                                  • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
                                                                                                                                  • Outline (7)
                                                                                                                                  • Avoiding Communication in Iterative Linear Algebra
                                                                                                                                  • Outline (8)
                                                                                                                                  • Example The Difficulty of Tuning SpMV
                                                                                                                                  • Example The Difficulty of Tuning
                                                                                                                                  • Speedups on Itanium 2 The Need for Search
                                                                                                                                  • Register Profile Itanium 2
                                                                                                                                  • Register Profiles IBM and Intel IA-64
                                                                                                                                  • Another example of tuning challenges for SpMV
                                                                                                                                  • Zoom in to top corner
                                                                                                                                  • 3x3 blocks look natural buthellip
                                                                                                                                  • Extra Work Can Improve Efficiency
                                                                                                                                  • Slide 86
                                                                                                                                  • Slide 87
                                                                                                                                  • Slide 88
                                                                                                                                  • Slide 89
                                                                                                                                  • Summary of Other Performance Optimizations
                                                                                                                                  • Optimized Sparse Kernel Interface - OSKI
                                                                                                                                  • Outline (9)
                                                                                                                                  • Example Classical Conjugate Gradient (CG)
                                                                                                                                  • Example CA-Conjugate Gradient
                                                                                                                                  • Outline (10)
                                                                                                                                  • Slide 96
                                                                                                                                  • Slide 97
                                                                                                                                  • Outline (11)
                                                                                                                                  • What is a ldquosparse matrixrdquo
                                                                                                                                  • Outline (12)
                                                                                                                                  • Reproducible Floating Point Computation
                                                                                                                                  • Intel MKL non-reproducibility
                                                                                                                                  • GoalsApproaches for Reproducibility
                                                                                                                                  • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
                                                                                                                                  • Collaborators and Supporters
                                                                                                                                  • Summary

                                                                                                                                    1

                                                                                                                                    1

                                                                                                                                    2

                                                                                                                                    2

                                                                                                                                    3

                                                                                                                                    3

                                                                                                                                    4

                                                                                                                                    4

                                                                                                                                    5

                                                                                                                                    5

                                                                                                                                    6

                                                                                                                                    6

                                                                                                                                    Q5T

                                                                                                                                    Q1

                                                                                                                                    Q1T

                                                                                                                                    Q2

                                                                                                                                    Q2T

                                                                                                                                    Q3

                                                                                                                                    Q3T

                                                                                                                                    Q5

                                                                                                                                    Q4

                                                                                                                                    Q4T

                                                                                                                                    b+1

                                                                                                                                    b+1

                                                                                                                                    d+1

                                                                                                                                    d+1

                                                                                                                                    c

                                                                                                                                    c

                                                                                                                                    d+c

                                                                                                                                    d+c

                                                                                                                                    d+c

                                                                                                                                    d+c

                                                                                                                                    b = bandwidthc = columnsd = diagonalsConstraint c+d b

                                                                                                                                    Successive Band Reduction (BischofLangSun)

                                                                                                                                    Conventional vs CA - SBR

                                                                                                                                    Conventional Communication-Avoiding

                                                                                                                                    Touch all data 4 times Touch all data once

                                                                                                                                    >
                                                                                                                                    >

                                                                                                                                    Speedups of Sym Band Reductionvs DSBTRD

                                                                                                                                    bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

                                                                                                                                    bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

                                                                                                                                    bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

                                                                                                                                    bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

                                                                                                                                    bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

                                                                                                                                    Nonsymmetric Eigenproblem

                                                                                                                                    bull No apparent way to modify standard algorithmbull Instead Spectral Divide-and-Conquer

                                                                                                                                    ndash Find orthogonal matrix Q whose leading columns span an invariant subspace of A

                                                                                                                                    ndash QTAQ will be block upper triangularndash Apply recursively to A11 A22

                                                                                                                                    ndash Depends on randomization1 Randomized Rank Revealing QR decomposition2 Randomized location to try splitting spectrum

                                                                                                                                    A11 A12

                                                                                                                                    ε A22

                                                                                                                                    Attaining the Lower bounds SequentialLegend[Existing][Ours][Math-Lib][Random]

                                                                                                                                    Two Levels Memory Hierarchy

                                                                                                                                    Words Messages Words Messages

                                                                                                                                    BLAS-3 [FLPRrsquo99][BDLSTrsquo13][MKL etc] [FLPRrsquo99][BDLSTrsquo13][MKL etc]

                                                                                                                                    Cholesky[Grsquo97][APrsquo00]

                                                                                                                                    [LAPACK][BDHSrsquo09]

                                                                                                                                    [Grsquo97][APrsquo00][BDHSrsquo09] [Grsquo97][APrsquo00][BDHSrsquo09]

                                                                                                                                    Sym Indefinite [BBDDDPSTYrsquo13] [BBDDDPSTYrsquo13]

                                                                                                                                    LU[Grsquo97][Trsquo97]

                                                                                                                                    [GDXrsquo11][BDLSTrsquo13]

                                                                                                                                    [GDXrsquo11][BDLSTrsquo13]

                                                                                                                                    [Grsquo97][Trsquo97] [BDLSTrsquo13] [BDLSTrsquo13]

                                                                                                                                    QR[EGrsquo98][FWrsquo03]

                                                                                                                                    [DGHLrsquo12][BDLSTrsquo13]

                                                                                                                                    [FWrsquo03][DGHLrsquo12][BDLSTrsquo13]

                                                                                                                                    [EGrsquo98][FWrsquo03][BDLSTrsquo13]

                                                                                                                                    [FWrsquo03][BDLSTrsquo13]

                                                                                                                                    Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

                                                                                                                                    Sym Eig amp SVD [BDDrsquo11][BDKrsquo13] [BDDrsquo11]

                                                                                                                                    Non Sym Eig [BDDrsquo11] [BDDrsquo11]

                                                                                                                                    Legend[Existing][Ours][Math-Lib][Random]

                                                                                                                                    Words (BW) Messages (L) Saving factor

                                                                                                                                    BLAS-3 [AGZrsquo94][MTrsquo99][ScaLAPACK][Crsquo69][vGWrsquo97][SDrsquo11] L nP12

                                                                                                                                    Cholesky [ScaLAPACK][Trsquo99][SDrsquo11] L nP12

                                                                                                                                    Sym Indefinite [BBDDDPSTYrsquo13][ScaLAPACK] [BBDDDPSTYrsquo13] L nP12

                                                                                                                                    LU [ScaLAPACK][GDXrsquo11][Trsquo99][SDrsquo11] [GDXrsquo11][Trsquo99][SDrsquo11] L nP12

                                                                                                                                    QR [ScaLAPACK][DGHLrsquo12] [Trsquo99] [DGHLrsquo12][Trsquo99] L nP12

                                                                                                                                    Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

                                                                                                                                    Sym Eig amp SVD [BDDrsquo11][BDKrsquo13][ScaLAPACK] [BDDrsquo11][BDKrsquo13] L nP12

                                                                                                                                    Non-Sym Eig [BDDrsquo11] [BDDrsquo11] BW P12 L n

                                                                                                                                    Attaining with extra memory 25D M=(cn2P)

                                                                                                                                    Attaining the Lower bounds Parallel 2DM=(n2P)(Ignoring poly-log(P) factors words = ( n2 P12) messages = (P12)

                                                                                                                                    Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                                    ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                                    ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                                    bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                                    bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                                    Avoiding Communication in Iterative Linear Algebra

                                                                                                                                    bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

                                                                                                                                    bull Conjugate Gradients (CG) GMRES Lanczos Arnoldi hellip bull Goal minimize communication

                                                                                                                                    ndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

                                                                                                                                    bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

                                                                                                                                    ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

                                                                                                                                    bull Lots of speed up possible (modeled and measured)ndash Price some redundant computationndash Challenges Poor partitioning Preconditioning Num Stability

                                                                                                                                    75

                                                                                                                                    Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                                    ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                                    ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                                    bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                                    bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                                    Example The Difficulty of Tuning SpMV

                                                                                                                                    bull n = 21200bull nnz = 15 M

                                                                                                                                    bull Source NASA structural analysis problem (raefsky)

                                                                                                                                    77

                                                                                                                                    Example The Difficulty of Tuning

                                                                                                                                    bull n = 21200bull nnz = 15 M

                                                                                                                                    bull Source NASA structural analysis problem (raefsky)

                                                                                                                                    bull 8x8 dense substructure exploit this to limit mem_refs

                                                                                                                                    78

                                                                                                                                    Speedups on Itanium 2 The Need for Search

                                                                                                                                    Reference

                                                                                                                                    Best 4x2

                                                                                                                                    Mflops

                                                                                                                                    Mflops

                                                                                                                                    79

                                                                                                                                    Register Profile Itanium 2

                                                                                                                                    190 Mflops

                                                                                                                                    1190 Mflops

                                                                                                                                    80

                                                                                                                                    Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16

                                                                                                                                    Itanium 2 - 33Itanium 1 - 8

                                                                                                                                    252 Mflops

                                                                                                                                    122 Mflops

                                                                                                                                    820 Mflops

                                                                                                                                    459 Mflops

                                                                                                                                    247 Mflops

                                                                                                                                    107 Mflops

                                                                                                                                    12 Gflops

                                                                                                                                    190 Mflops

                                                                                                                                    Another example of tuning challenges for SpMV

                                                                                                                                    bull Ex11 matrix (fluid flow)

                                                                                                                                    bull More complicated non-zero structure in general

                                                                                                                                    bull N = 16614bull NNZ = 11M

                                                                                                                                    82

                                                                                                                                    Zoom in to top corner

                                                                                                                                    bull More complicated non-zero structure in general

                                                                                                                                    bull N = 16614bull NNZ = 11M

                                                                                                                                    83

                                                                                                                                    3x3 blocks look natural buthellip

                                                                                                                                    bull Example 3x3 blockingndash Logical grid of 3x3 cells

                                                                                                                                    bull But would lead to lots of ldquofill-inrdquo

                                                                                                                                    84

                                                                                                                                    Extra Work Can Improve Efficiency

                                                                                                                                    bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

                                                                                                                                    bull On Pentium III 15x speedupndash Actual mflop rate 152 = 225 higher

                                                                                                                                    85

                                                                                                                                    Source Accelerator Cavity Design Problem (Ko via Husbands)

                                                                                                                                    86

                                                                                                                                    100x100 Submatrix Along Diagonal

                                                                                                                                    Summer School Lecture 787

                                                                                                                                    Post-RCM Reordering

                                                                                                                                    88

                                                                                                                                    Effect of Combined RCM+TSP Reordering

                                                                                                                                    Before Green + RedAfter Green + Blue

                                                                                                                                    Summer School Lecture 789

                                                                                                                                    2x speedups on Pentium 4 Power 4 hellip

                                                                                                                                    Summary of Other Performance Optimizations

                                                                                                                                    bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

                                                                                                                                    bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

                                                                                                                                    bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

                                                                                                                                    90

                                                                                                                                    Optimized Sparse Kernel Interface - OSKI

                                                                                                                                    bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

                                                                                                                                    bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

                                                                                                                                    bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

                                                                                                                                    software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

                                                                                                                                    91

                                                                                                                                    Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                                    ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                                    ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                                    bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                                    bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                                    93

                                                                                                                                    Example Classical Conjugate Gradient (CG)

                                                                                                                                    SpMVs and dot products require communication in

                                                                                                                                    each iteration

                                                                                                                                    via CA Matrix Powers Kernel

                                                                                                                                    Global reduction to compute G

                                                                                                                                    94

                                                                                                                                    Example CA-Conjugate Gradient

                                                                                                                                    Local computations within inner loop require

                                                                                                                                    no communication

                                                                                                                                    Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                                    ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                                    ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                                    bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                                    bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                                    96

                                                                                                                                    Slower convergence due

                                                                                                                                    to roundoff

                                                                                                                                    Loss of accuracy due to roundoff

                                                                                                                                    At s = 16 monomial basis is rank deficient Method breaks down

                                                                                                                                    Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

                                                                                                                                    CA-CG (monomial)CG

                                                                                                                                    machine precision

                                                                                                                                    97

                                                                                                                                    Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                                    ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                                    ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                                    bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                                    bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                                    What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

                                                                                                                                    bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

                                                                                                                                    bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

                                                                                                                                    matrices

                                                                                                                                    Explicit (O(nnz)) Implicit (o(nnz))

                                                                                                                                    Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

                                                                                                                                    Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

                                                                                                                                    Indices

                                                                                                                                    Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                                    ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                                    ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                                    bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                                    bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                                    101

                                                                                                                                    bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

                                                                                                                                    ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

                                                                                                                                    demanded by customers (construction engineers) otherwise they donrsquot believe results

                                                                                                                                    ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

                                                                                                                                    ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

                                                                                                                                    Reproducible Floating Point Computation

                                                                                                                                    Absolute Error for Random Vectors

                                                                                                                                    Same magnitude opposite signs

                                                                                                                                    Intel MKL non-reproducibility

                                                                                                                                    Relative Error for Orthogonal vectors

                                                                                                                                    Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

                                                                                                                                    Sign notreproducible

                                                                                                                                    103

                                                                                                                                    bull Consider summation or dot productbull Goals

                                                                                                                                    1 Same answer independent of layout processors order of summands

                                                                                                                                    2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

                                                                                                                                    bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

                                                                                                                                    GoalsApproaches for Reproducibility

                                                                                                                                    104

                                                                                                                                    Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

                                                                                                                                    Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

                                                                                                                                    Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

                                                                                                                                    bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

                                                                                                                                    Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

                                                                                                                                    bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

                                                                                                                                    Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

                                                                                                                                    Instruments NEC Nokia NVIDIA Samsung Oracle

                                                                                                                                    bull bebopcsberkeleyedu

                                                                                                                                    Summary

                                                                                                                                    Donrsquot Communichellip

                                                                                                                                    106

                                                                                                                                    Time to redesign all linear algebra n-body hellip algorithms and software

                                                                                                                                    (and compilers)

                                                                                                                                    • Implementing Communication-Avoiding Algorithms
                                                                                                                                    • Why avoid communication
                                                                                                                                    • Goals
                                                                                                                                    • Outline
                                                                                                                                    • Outline (2)
                                                                                                                                    • Lower bound for all ldquon3-likerdquo linear algebra
                                                                                                                                    • Lower bound for all ldquon3-likerdquo linear algebra (2)
                                                                                                                                    • Lower bound for all ldquon3-likerdquo linear algebra (3)
                                                                                                                                    • Limits to parallel scaling (12)
                                                                                                                                    • Limits to parallel scaling (22)
                                                                                                                                    • Can we attain these lower bounds
                                                                                                                                    • Outline (3)
                                                                                                                                    • 25D Matrix Multiplication
                                                                                                                                    • 25D Matrix Multiplication (2)
                                                                                                                                    • 25D Matmul on BGP 16K nodes 64K cores (2)
                                                                                                                                    • Perfect Strong Scaling ndash in Time and Energy (12)
                                                                                                                                    • Perfect Strong Scaling ndash in Time and Energy (22)
                                                                                                                                    • Handling Heterogeneity
                                                                                                                                    • Application to Tensor Contractions
                                                                                                                                    • C(ijk) = Σm A(ijm)B(mk)
                                                                                                                                    • Application to Tensor Contractions (2)
                                                                                                                                    • Communication Lower Bounds for Strassen-like matmul algorithms
                                                                                                                                    • vs
                                                                                                                                    • Slide 26
                                                                                                                                    • Strassen-like beyond matmul
                                                                                                                                    • Cache and Network Oblivious Algorithms
                                                                                                                                    • CARMA Performance Distributed Memory
                                                                                                                                    • CARMA Performance Distributed Memory (2)
                                                                                                                                    • CARMA Performance Shared Memory
                                                                                                                                    • CARMA Performance Shared Memory (2)
                                                                                                                                    • Why is CARMA Faster in Shared Memory
                                                                                                                                    • Outline (4)
                                                                                                                                    • One-sided Factorizations (LU QR) so far
                                                                                                                                    • TSQR An Architecture-Dependent Algorithm
                                                                                                                                    • Back to LU Using similar idea for TSLU as TSQR Use reduction
                                                                                                                                    • Minimizing Communication in TSLU
                                                                                                                                    • Making TSLU Numerically Stable
                                                                                                                                    • Stability of LU using TSLU CALU
                                                                                                                                    • Why is stability of TSLU just a ldquoThmrdquo
                                                                                                                                    • Fixing TSLU
                                                                                                                                    • 2D CALU with Tournament Pivoting
                                                                                                                                    • 25D CALU with Tournament Pivoting (c=4 copies)
                                                                                                                                    • Exascale Machine Parameters Source DOE Exascale Workshop
                                                                                                                                    • Exascale predicted speedups for Gaussian Elimination 2D CA
                                                                                                                                    • 25D vs 2D LU With and Without Pivoting
                                                                                                                                    • Other CA algorithms for Ax=b least squares(13)
                                                                                                                                    • Other CA algorithms for Ax=b least squares (23)
                                                                                                                                    • Other CA algorithms for Ax=b least squares (33)
                                                                                                                                    • Outline (5)
                                                                                                                                    • What about sparse matrices (13)
                                                                                                                                    • Performance of 25D APSP using Kleene
                                                                                                                                    • What about sparse matrices (23)
                                                                                                                                    • What about sparse matrices (33)
                                                                                                                                    • Outline (6)
                                                                                                                                    • Symmetric Eigenproblem and SVD
                                                                                                                                    • Slide 58
                                                                                                                                    • Slide 59
                                                                                                                                    • Slide 60
                                                                                                                                    • Slide 61
                                                                                                                                    • Slide 62
                                                                                                                                    • Slide 63
                                                                                                                                    • Slide 64
                                                                                                                                    • Slide 65
                                                                                                                                    • Slide 66
                                                                                                                                    • Slide 67
                                                                                                                                    • Slide 68
                                                                                                                                    • Conventional vs CA - SBR
                                                                                                                                    • Speedups of Sym Band Reduction vs DSBTRD
                                                                                                                                    • Nonsymmetric Eigenproblem
                                                                                                                                    • Attaining the Lower bounds Sequential
                                                                                                                                    • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
                                                                                                                                    • Outline (7)
                                                                                                                                    • Avoiding Communication in Iterative Linear Algebra
                                                                                                                                    • Outline (8)
                                                                                                                                    • Example The Difficulty of Tuning SpMV
                                                                                                                                    • Example The Difficulty of Tuning
                                                                                                                                    • Speedups on Itanium 2 The Need for Search
                                                                                                                                    • Register Profile Itanium 2
                                                                                                                                    • Register Profiles IBM and Intel IA-64
                                                                                                                                    • Another example of tuning challenges for SpMV
                                                                                                                                    • Zoom in to top corner
                                                                                                                                    • 3x3 blocks look natural buthellip
                                                                                                                                    • Extra Work Can Improve Efficiency
                                                                                                                                    • Slide 86
                                                                                                                                    • Slide 87
                                                                                                                                    • Slide 88
                                                                                                                                    • Slide 89
                                                                                                                                    • Summary of Other Performance Optimizations
                                                                                                                                    • Optimized Sparse Kernel Interface - OSKI
                                                                                                                                    • Outline (9)
                                                                                                                                    • Example Classical Conjugate Gradient (CG)
                                                                                                                                    • Example CA-Conjugate Gradient
                                                                                                                                    • Outline (10)
                                                                                                                                    • Slide 96
                                                                                                                                    • Slide 97
                                                                                                                                    • Outline (11)
                                                                                                                                    • What is a ldquosparse matrixrdquo
                                                                                                                                    • Outline (12)
                                                                                                                                    • Reproducible Floating Point Computation
                                                                                                                                    • Intel MKL non-reproducibility
                                                                                                                                    • GoalsApproaches for Reproducibility
                                                                                                                                    • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
                                                                                                                                    • Collaborators and Supporters
                                                                                                                                    • Summary

                                                                                                                                      Conventional vs CA - SBR

                                                                                                                                      Conventional Communication-Avoiding

                                                                                                                                      Touch all data 4 times Touch all data once

                                                                                                                                      >
                                                                                                                                      >

                                                                                                                                      Speedups of Sym Band Reductionvs DSBTRD

                                                                                                                                      bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

                                                                                                                                      bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

                                                                                                                                      bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

                                                                                                                                      bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

                                                                                                                                      bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

                                                                                                                                      Nonsymmetric Eigenproblem

                                                                                                                                      bull No apparent way to modify standard algorithmbull Instead Spectral Divide-and-Conquer

                                                                                                                                      ndash Find orthogonal matrix Q whose leading columns span an invariant subspace of A

                                                                                                                                      ndash QTAQ will be block upper triangularndash Apply recursively to A11 A22

                                                                                                                                      ndash Depends on randomization1 Randomized Rank Revealing QR decomposition2 Randomized location to try splitting spectrum

                                                                                                                                      A11 A12

                                                                                                                                      ε A22

                                                                                                                                      Attaining the Lower bounds SequentialLegend[Existing][Ours][Math-Lib][Random]

                                                                                                                                      Two Levels Memory Hierarchy

                                                                                                                                      Words Messages Words Messages

                                                                                                                                      BLAS-3 [FLPRrsquo99][BDLSTrsquo13][MKL etc] [FLPRrsquo99][BDLSTrsquo13][MKL etc]

                                                                                                                                      Cholesky[Grsquo97][APrsquo00]

                                                                                                                                      [LAPACK][BDHSrsquo09]

                                                                                                                                      [Grsquo97][APrsquo00][BDHSrsquo09] [Grsquo97][APrsquo00][BDHSrsquo09]

                                                                                                                                      Sym Indefinite [BBDDDPSTYrsquo13] [BBDDDPSTYrsquo13]

                                                                                                                                      LU[Grsquo97][Trsquo97]

                                                                                                                                      [GDXrsquo11][BDLSTrsquo13]

                                                                                                                                      [GDXrsquo11][BDLSTrsquo13]

                                                                                                                                      [Grsquo97][Trsquo97] [BDLSTrsquo13] [BDLSTrsquo13]

                                                                                                                                      QR[EGrsquo98][FWrsquo03]

                                                                                                                                      [DGHLrsquo12][BDLSTrsquo13]

                                                                                                                                      [FWrsquo03][DGHLrsquo12][BDLSTrsquo13]

                                                                                                                                      [EGrsquo98][FWrsquo03][BDLSTrsquo13]

                                                                                                                                      [FWrsquo03][BDLSTrsquo13]

                                                                                                                                      Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

                                                                                                                                      Sym Eig amp SVD [BDDrsquo11][BDKrsquo13] [BDDrsquo11]

                                                                                                                                      Non Sym Eig [BDDrsquo11] [BDDrsquo11]

                                                                                                                                      Legend[Existing][Ours][Math-Lib][Random]

                                                                                                                                      Words (BW) Messages (L) Saving factor

                                                                                                                                      BLAS-3 [AGZrsquo94][MTrsquo99][ScaLAPACK][Crsquo69][vGWrsquo97][SDrsquo11] L nP12

                                                                                                                                      Cholesky [ScaLAPACK][Trsquo99][SDrsquo11] L nP12

                                                                                                                                      Sym Indefinite [BBDDDPSTYrsquo13][ScaLAPACK] [BBDDDPSTYrsquo13] L nP12

                                                                                                                                      LU [ScaLAPACK][GDXrsquo11][Trsquo99][SDrsquo11] [GDXrsquo11][Trsquo99][SDrsquo11] L nP12

                                                                                                                                      QR [ScaLAPACK][DGHLrsquo12] [Trsquo99] [DGHLrsquo12][Trsquo99] L nP12

                                                                                                                                      Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

                                                                                                                                      Sym Eig amp SVD [BDDrsquo11][BDKrsquo13][ScaLAPACK] [BDDrsquo11][BDKrsquo13] L nP12

                                                                                                                                      Non-Sym Eig [BDDrsquo11] [BDDrsquo11] BW P12 L n

                                                                                                                                      Attaining with extra memory 25D M=(cn2P)

                                                                                                                                      Attaining the Lower bounds Parallel 2DM=(n2P)(Ignoring poly-log(P) factors words = ( n2 P12) messages = (P12)

                                                                                                                                      Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                                      ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                                      ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                                      bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                                      bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                                      Avoiding Communication in Iterative Linear Algebra

                                                                                                                                      bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

                                                                                                                                      bull Conjugate Gradients (CG) GMRES Lanczos Arnoldi hellip bull Goal minimize communication

                                                                                                                                      ndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

                                                                                                                                      bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

                                                                                                                                      ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

                                                                                                                                      bull Lots of speed up possible (modeled and measured)ndash Price some redundant computationndash Challenges Poor partitioning Preconditioning Num Stability

                                                                                                                                      75

                                                                                                                                      Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                                      ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                                      ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                                      bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                                      bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                                      Example The Difficulty of Tuning SpMV

                                                                                                                                      bull n = 21200bull nnz = 15 M

                                                                                                                                      bull Source NASA structural analysis problem (raefsky)

                                                                                                                                      77

                                                                                                                                      Example The Difficulty of Tuning

                                                                                                                                      bull n = 21200bull nnz = 15 M

                                                                                                                                      bull Source NASA structural analysis problem (raefsky)

                                                                                                                                      bull 8x8 dense substructure exploit this to limit mem_refs

                                                                                                                                      78

                                                                                                                                      Speedups on Itanium 2 The Need for Search

                                                                                                                                      Reference

                                                                                                                                      Best 4x2

                                                                                                                                      Mflops

                                                                                                                                      Mflops

                                                                                                                                      79

                                                                                                                                      Register Profile Itanium 2

                                                                                                                                      190 Mflops

                                                                                                                                      1190 Mflops

                                                                                                                                      80

                                                                                                                                      Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16

                                                                                                                                      Itanium 2 - 33Itanium 1 - 8

                                                                                                                                      252 Mflops

                                                                                                                                      122 Mflops

                                                                                                                                      820 Mflops

                                                                                                                                      459 Mflops

                                                                                                                                      247 Mflops

                                                                                                                                      107 Mflops

                                                                                                                                      12 Gflops

                                                                                                                                      190 Mflops

                                                                                                                                      Another example of tuning challenges for SpMV

                                                                                                                                      bull Ex11 matrix (fluid flow)

                                                                                                                                      bull More complicated non-zero structure in general

                                                                                                                                      bull N = 16614bull NNZ = 11M

                                                                                                                                      82

                                                                                                                                      Zoom in to top corner

                                                                                                                                      bull More complicated non-zero structure in general

                                                                                                                                      bull N = 16614bull NNZ = 11M

                                                                                                                                      83

                                                                                                                                      3x3 blocks look natural buthellip

                                                                                                                                      bull Example 3x3 blockingndash Logical grid of 3x3 cells

                                                                                                                                      bull But would lead to lots of ldquofill-inrdquo

                                                                                                                                      84

                                                                                                                                      Extra Work Can Improve Efficiency

                                                                                                                                      bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

                                                                                                                                      bull On Pentium III 15x speedupndash Actual mflop rate 152 = 225 higher

                                                                                                                                      85

                                                                                                                                      Source Accelerator Cavity Design Problem (Ko via Husbands)

                                                                                                                                      86

                                                                                                                                      100x100 Submatrix Along Diagonal

                                                                                                                                      Summer School Lecture 787

                                                                                                                                      Post-RCM Reordering

                                                                                                                                      88

                                                                                                                                      Effect of Combined RCM+TSP Reordering

                                                                                                                                      Before Green + RedAfter Green + Blue

                                                                                                                                      Summer School Lecture 789

                                                                                                                                      2x speedups on Pentium 4 Power 4 hellip

                                                                                                                                      Summary of Other Performance Optimizations

                                                                                                                                      bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

                                                                                                                                      bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

                                                                                                                                      bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

                                                                                                                                      90

                                                                                                                                      Optimized Sparse Kernel Interface - OSKI

                                                                                                                                      bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

                                                                                                                                      bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

                                                                                                                                      bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

                                                                                                                                      software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

                                                                                                                                      91

                                                                                                                                      Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                                      ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                                      ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                                      bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                                      bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                                      93

                                                                                                                                      Example Classical Conjugate Gradient (CG)

                                                                                                                                      SpMVs and dot products require communication in

                                                                                                                                      each iteration

                                                                                                                                      via CA Matrix Powers Kernel

                                                                                                                                      Global reduction to compute G

                                                                                                                                      94

                                                                                                                                      Example CA-Conjugate Gradient

                                                                                                                                      Local computations within inner loop require

                                                                                                                                      no communication

                                                                                                                                      Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                                      ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                                      ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                                      bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                                      bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                                      96

                                                                                                                                      Slower convergence due

                                                                                                                                      to roundoff

                                                                                                                                      Loss of accuracy due to roundoff

                                                                                                                                      At s = 16 monomial basis is rank deficient Method breaks down

                                                                                                                                      Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

                                                                                                                                      CA-CG (monomial)CG

                                                                                                                                      machine precision

                                                                                                                                      97

                                                                                                                                      Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                                      ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                                      ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                                      bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                                      bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                                      What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

                                                                                                                                      bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

                                                                                                                                      bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

                                                                                                                                      matrices

                                                                                                                                      Explicit (O(nnz)) Implicit (o(nnz))

                                                                                                                                      Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

                                                                                                                                      Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

                                                                                                                                      Indices

                                                                                                                                      Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                                      ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                                      ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                                      bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                                      bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                                      101

                                                                                                                                      bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

                                                                                                                                      ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

                                                                                                                                      demanded by customers (construction engineers) otherwise they donrsquot believe results

                                                                                                                                      ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

                                                                                                                                      ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

                                                                                                                                      Reproducible Floating Point Computation

                                                                                                                                      Absolute Error for Random Vectors

                                                                                                                                      Same magnitude opposite signs

                                                                                                                                      Intel MKL non-reproducibility

                                                                                                                                      Relative Error for Orthogonal vectors

                                                                                                                                      Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

                                                                                                                                      Sign notreproducible

                                                                                                                                      103

                                                                                                                                      bull Consider summation or dot productbull Goals

                                                                                                                                      1 Same answer independent of layout processors order of summands

                                                                                                                                      2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

                                                                                                                                      bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

                                                                                                                                      GoalsApproaches for Reproducibility

                                                                                                                                      104

                                                                                                                                      Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

                                                                                                                                      Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

                                                                                                                                      Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

                                                                                                                                      bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

                                                                                                                                      Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

                                                                                                                                      bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

                                                                                                                                      Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

                                                                                                                                      Instruments NEC Nokia NVIDIA Samsung Oracle

                                                                                                                                      bull bebopcsberkeleyedu

                                                                                                                                      Summary

                                                                                                                                      Donrsquot Communichellip

                                                                                                                                      106

                                                                                                                                      Time to redesign all linear algebra n-body hellip algorithms and software

                                                                                                                                      (and compilers)

                                                                                                                                      • Implementing Communication-Avoiding Algorithms
                                                                                                                                      • Why avoid communication
                                                                                                                                      • Goals
                                                                                                                                      • Outline
                                                                                                                                      • Outline (2)
                                                                                                                                      • Lower bound for all ldquon3-likerdquo linear algebra
                                                                                                                                      • Lower bound for all ldquon3-likerdquo linear algebra (2)
                                                                                                                                      • Lower bound for all ldquon3-likerdquo linear algebra (3)
                                                                                                                                      • Limits to parallel scaling (12)
                                                                                                                                      • Limits to parallel scaling (22)
                                                                                                                                      • Can we attain these lower bounds
                                                                                                                                      • Outline (3)
                                                                                                                                      • 25D Matrix Multiplication
                                                                                                                                      • 25D Matrix Multiplication (2)
                                                                                                                                      • 25D Matmul on BGP 16K nodes 64K cores (2)
                                                                                                                                      • Perfect Strong Scaling ndash in Time and Energy (12)
                                                                                                                                      • Perfect Strong Scaling ndash in Time and Energy (22)
                                                                                                                                      • Handling Heterogeneity
                                                                                                                                      • Application to Tensor Contractions
                                                                                                                                      • C(ijk) = Σm A(ijm)B(mk)
                                                                                                                                      • Application to Tensor Contractions (2)
                                                                                                                                      • Communication Lower Bounds for Strassen-like matmul algorithms
                                                                                                                                      • vs
                                                                                                                                      • Slide 26
                                                                                                                                      • Strassen-like beyond matmul
                                                                                                                                      • Cache and Network Oblivious Algorithms
                                                                                                                                      • CARMA Performance Distributed Memory
                                                                                                                                      • CARMA Performance Distributed Memory (2)
                                                                                                                                      • CARMA Performance Shared Memory
                                                                                                                                      • CARMA Performance Shared Memory (2)
                                                                                                                                      • Why is CARMA Faster in Shared Memory
                                                                                                                                      • Outline (4)
                                                                                                                                      • One-sided Factorizations (LU QR) so far
                                                                                                                                      • TSQR An Architecture-Dependent Algorithm
                                                                                                                                      • Back to LU Using similar idea for TSLU as TSQR Use reduction
                                                                                                                                      • Minimizing Communication in TSLU
                                                                                                                                      • Making TSLU Numerically Stable
                                                                                                                                      • Stability of LU using TSLU CALU
                                                                                                                                      • Why is stability of TSLU just a ldquoThmrdquo
                                                                                                                                      • Fixing TSLU
                                                                                                                                      • 2D CALU with Tournament Pivoting
                                                                                                                                      • 25D CALU with Tournament Pivoting (c=4 copies)
                                                                                                                                      • Exascale Machine Parameters Source DOE Exascale Workshop
                                                                                                                                      • Exascale predicted speedups for Gaussian Elimination 2D CA
                                                                                                                                      • 25D vs 2D LU With and Without Pivoting
                                                                                                                                      • Other CA algorithms for Ax=b least squares(13)
                                                                                                                                      • Other CA algorithms for Ax=b least squares (23)
                                                                                                                                      • Other CA algorithms for Ax=b least squares (33)
                                                                                                                                      • Outline (5)
                                                                                                                                      • What about sparse matrices (13)
                                                                                                                                      • Performance of 25D APSP using Kleene
                                                                                                                                      • What about sparse matrices (23)
                                                                                                                                      • What about sparse matrices (33)
                                                                                                                                      • Outline (6)
                                                                                                                                      • Symmetric Eigenproblem and SVD
                                                                                                                                      • Slide 58
                                                                                                                                      • Slide 59
                                                                                                                                      • Slide 60
                                                                                                                                      • Slide 61
                                                                                                                                      • Slide 62
                                                                                                                                      • Slide 63
                                                                                                                                      • Slide 64
                                                                                                                                      • Slide 65
                                                                                                                                      • Slide 66
                                                                                                                                      • Slide 67
                                                                                                                                      • Slide 68
                                                                                                                                      • Conventional vs CA - SBR
                                                                                                                                      • Speedups of Sym Band Reduction vs DSBTRD
                                                                                                                                      • Nonsymmetric Eigenproblem
                                                                                                                                      • Attaining the Lower bounds Sequential
                                                                                                                                      • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
                                                                                                                                      • Outline (7)
                                                                                                                                      • Avoiding Communication in Iterative Linear Algebra
                                                                                                                                      • Outline (8)
                                                                                                                                      • Example The Difficulty of Tuning SpMV
                                                                                                                                      • Example The Difficulty of Tuning
                                                                                                                                      • Speedups on Itanium 2 The Need for Search
                                                                                                                                      • Register Profile Itanium 2
                                                                                                                                      • Register Profiles IBM and Intel IA-64
                                                                                                                                      • Another example of tuning challenges for SpMV
                                                                                                                                      • Zoom in to top corner
                                                                                                                                      • 3x3 blocks look natural buthellip
                                                                                                                                      • Extra Work Can Improve Efficiency
                                                                                                                                      • Slide 86
                                                                                                                                      • Slide 87
                                                                                                                                      • Slide 88
                                                                                                                                      • Slide 89
                                                                                                                                      • Summary of Other Performance Optimizations
                                                                                                                                      • Optimized Sparse Kernel Interface - OSKI
                                                                                                                                      • Outline (9)
                                                                                                                                      • Example Classical Conjugate Gradient (CG)
                                                                                                                                      • Example CA-Conjugate Gradient
                                                                                                                                      • Outline (10)
                                                                                                                                      • Slide 96
                                                                                                                                      • Slide 97
                                                                                                                                      • Outline (11)
                                                                                                                                      • What is a ldquosparse matrixrdquo
                                                                                                                                      • Outline (12)
                                                                                                                                      • Reproducible Floating Point Computation
                                                                                                                                      • Intel MKL non-reproducibility
                                                                                                                                      • GoalsApproaches for Reproducibility
                                                                                                                                      • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
                                                                                                                                      • Collaborators and Supporters
                                                                                                                                      • Summary

                                                                                                                                        Speedups of Sym Band Reductionvs DSBTRD

                                                                                                                                        bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

                                                                                                                                        bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

                                                                                                                                        bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

                                                                                                                                        bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

                                                                                                                                        bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

                                                                                                                                        Nonsymmetric Eigenproblem

                                                                                                                                        bull No apparent way to modify standard algorithmbull Instead Spectral Divide-and-Conquer

                                                                                                                                        ndash Find orthogonal matrix Q whose leading columns span an invariant subspace of A

                                                                                                                                        ndash QTAQ will be block upper triangularndash Apply recursively to A11 A22

                                                                                                                                        ndash Depends on randomization1 Randomized Rank Revealing QR decomposition2 Randomized location to try splitting spectrum

                                                                                                                                        A11 A12

                                                                                                                                        ε A22

                                                                                                                                        Attaining the Lower bounds SequentialLegend[Existing][Ours][Math-Lib][Random]

                                                                                                                                        Two Levels Memory Hierarchy

                                                                                                                                        Words Messages Words Messages

                                                                                                                                        BLAS-3 [FLPRrsquo99][BDLSTrsquo13][MKL etc] [FLPRrsquo99][BDLSTrsquo13][MKL etc]

                                                                                                                                        Cholesky[Grsquo97][APrsquo00]

                                                                                                                                        [LAPACK][BDHSrsquo09]

                                                                                                                                        [Grsquo97][APrsquo00][BDHSrsquo09] [Grsquo97][APrsquo00][BDHSrsquo09]

                                                                                                                                        Sym Indefinite [BBDDDPSTYrsquo13] [BBDDDPSTYrsquo13]

                                                                                                                                        LU[Grsquo97][Trsquo97]

                                                                                                                                        [GDXrsquo11][BDLSTrsquo13]

                                                                                                                                        [GDXrsquo11][BDLSTrsquo13]

                                                                                                                                        [Grsquo97][Trsquo97] [BDLSTrsquo13] [BDLSTrsquo13]

                                                                                                                                        QR[EGrsquo98][FWrsquo03]

                                                                                                                                        [DGHLrsquo12][BDLSTrsquo13]

                                                                                                                                        [FWrsquo03][DGHLrsquo12][BDLSTrsquo13]

                                                                                                                                        [EGrsquo98][FWrsquo03][BDLSTrsquo13]

                                                                                                                                        [FWrsquo03][BDLSTrsquo13]

                                                                                                                                        Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

                                                                                                                                        Sym Eig amp SVD [BDDrsquo11][BDKrsquo13] [BDDrsquo11]

                                                                                                                                        Non Sym Eig [BDDrsquo11] [BDDrsquo11]

                                                                                                                                        Legend[Existing][Ours][Math-Lib][Random]

                                                                                                                                        Words (BW) Messages (L) Saving factor

                                                                                                                                        BLAS-3 [AGZrsquo94][MTrsquo99][ScaLAPACK][Crsquo69][vGWrsquo97][SDrsquo11] L nP12

                                                                                                                                        Cholesky [ScaLAPACK][Trsquo99][SDrsquo11] L nP12

                                                                                                                                        Sym Indefinite [BBDDDPSTYrsquo13][ScaLAPACK] [BBDDDPSTYrsquo13] L nP12

                                                                                                                                        LU [ScaLAPACK][GDXrsquo11][Trsquo99][SDrsquo11] [GDXrsquo11][Trsquo99][SDrsquo11] L nP12

                                                                                                                                        QR [ScaLAPACK][DGHLrsquo12] [Trsquo99] [DGHLrsquo12][Trsquo99] L nP12

                                                                                                                                        Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

                                                                                                                                        Sym Eig amp SVD [BDDrsquo11][BDKrsquo13][ScaLAPACK] [BDDrsquo11][BDKrsquo13] L nP12

                                                                                                                                        Non-Sym Eig [BDDrsquo11] [BDDrsquo11] BW P12 L n

                                                                                                                                        Attaining with extra memory 25D M=(cn2P)

                                                                                                                                        Attaining the Lower bounds Parallel 2DM=(n2P)(Ignoring poly-log(P) factors words = ( n2 P12) messages = (P12)

                                                                                                                                        Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                                        ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                                        ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                                        bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                                        bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                                        Avoiding Communication in Iterative Linear Algebra

                                                                                                                                        bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

                                                                                                                                        bull Conjugate Gradients (CG) GMRES Lanczos Arnoldi hellip bull Goal minimize communication

                                                                                                                                        ndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

                                                                                                                                        bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

                                                                                                                                        ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

                                                                                                                                        bull Lots of speed up possible (modeled and measured)ndash Price some redundant computationndash Challenges Poor partitioning Preconditioning Num Stability

                                                                                                                                        75

                                                                                                                                        Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                                        ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                                        ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                                        bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                                        bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                                        Example The Difficulty of Tuning SpMV

                                                                                                                                        bull n = 21200bull nnz = 15 M

                                                                                                                                        bull Source NASA structural analysis problem (raefsky)

                                                                                                                                        77

                                                                                                                                        Example The Difficulty of Tuning

                                                                                                                                        bull n = 21200bull nnz = 15 M

                                                                                                                                        bull Source NASA structural analysis problem (raefsky)

                                                                                                                                        bull 8x8 dense substructure exploit this to limit mem_refs

                                                                                                                                        78

                                                                                                                                        Speedups on Itanium 2 The Need for Search

                                                                                                                                        Reference

                                                                                                                                        Best 4x2

                                                                                                                                        Mflops

                                                                                                                                        Mflops

                                                                                                                                        79

                                                                                                                                        Register Profile Itanium 2

                                                                                                                                        190 Mflops

                                                                                                                                        1190 Mflops

                                                                                                                                        80

                                                                                                                                        Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16

                                                                                                                                        Itanium 2 - 33Itanium 1 - 8

                                                                                                                                        252 Mflops

                                                                                                                                        122 Mflops

                                                                                                                                        820 Mflops

                                                                                                                                        459 Mflops

                                                                                                                                        247 Mflops

                                                                                                                                        107 Mflops

                                                                                                                                        12 Gflops

                                                                                                                                        190 Mflops

                                                                                                                                        Another example of tuning challenges for SpMV

                                                                                                                                        bull Ex11 matrix (fluid flow)

                                                                                                                                        bull More complicated non-zero structure in general

                                                                                                                                        bull N = 16614bull NNZ = 11M

                                                                                                                                        82

                                                                                                                                        Zoom in to top corner

                                                                                                                                        bull More complicated non-zero structure in general

                                                                                                                                        bull N = 16614bull NNZ = 11M

                                                                                                                                        83

                                                                                                                                        3x3 blocks look natural buthellip

                                                                                                                                        bull Example 3x3 blockingndash Logical grid of 3x3 cells

                                                                                                                                        bull But would lead to lots of ldquofill-inrdquo

                                                                                                                                        84

                                                                                                                                        Extra Work Can Improve Efficiency

                                                                                                                                        bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

                                                                                                                                        bull On Pentium III 15x speedupndash Actual mflop rate 152 = 225 higher

                                                                                                                                        85

                                                                                                                                        Source Accelerator Cavity Design Problem (Ko via Husbands)

                                                                                                                                        86

                                                                                                                                        100x100 Submatrix Along Diagonal

                                                                                                                                        Summer School Lecture 787

                                                                                                                                        Post-RCM Reordering

                                                                                                                                        88

                                                                                                                                        Effect of Combined RCM+TSP Reordering

                                                                                                                                        Before Green + RedAfter Green + Blue

                                                                                                                                        Summer School Lecture 789

                                                                                                                                        2x speedups on Pentium 4 Power 4 hellip

                                                                                                                                        Summary of Other Performance Optimizations

                                                                                                                                        bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

                                                                                                                                        bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

                                                                                                                                        bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

                                                                                                                                        90

                                                                                                                                        Optimized Sparse Kernel Interface - OSKI

                                                                                                                                        bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

                                                                                                                                        bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

                                                                                                                                        bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

                                                                                                                                        software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

                                                                                                                                        91

                                                                                                                                        Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                                        ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                                        ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                                        bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                                        bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                                        93

                                                                                                                                        Example Classical Conjugate Gradient (CG)

                                                                                                                                        SpMVs and dot products require communication in

                                                                                                                                        each iteration

                                                                                                                                        via CA Matrix Powers Kernel

                                                                                                                                        Global reduction to compute G

                                                                                                                                        94

                                                                                                                                        Example CA-Conjugate Gradient

                                                                                                                                        Local computations within inner loop require

                                                                                                                                        no communication

                                                                                                                                        Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                                        ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                                        ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                                        bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                                        bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                                        96

                                                                                                                                        Slower convergence due

                                                                                                                                        to roundoff

                                                                                                                                        Loss of accuracy due to roundoff

                                                                                                                                        At s = 16 monomial basis is rank deficient Method breaks down

                                                                                                                                        Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

                                                                                                                                        CA-CG (monomial)CG

                                                                                                                                        machine precision

                                                                                                                                        97

                                                                                                                                        Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                                        ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                                        ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                                        bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                                        bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                                        What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

                                                                                                                                        bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

                                                                                                                                        bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

                                                                                                                                        matrices

                                                                                                                                        Explicit (O(nnz)) Implicit (o(nnz))

                                                                                                                                        Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

                                                                                                                                        Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

                                                                                                                                        Indices

                                                                                                                                        Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                                        ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                                        ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                                        bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                                        bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                                        101

                                                                                                                                        bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

                                                                                                                                        ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

                                                                                                                                        demanded by customers (construction engineers) otherwise they donrsquot believe results

                                                                                                                                        ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

                                                                                                                                        ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

                                                                                                                                        Reproducible Floating Point Computation

                                                                                                                                        Absolute Error for Random Vectors

                                                                                                                                        Same magnitude opposite signs

                                                                                                                                        Intel MKL non-reproducibility

                                                                                                                                        Relative Error for Orthogonal vectors

                                                                                                                                        Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

                                                                                                                                        Sign notreproducible

                                                                                                                                        103

                                                                                                                                        bull Consider summation or dot productbull Goals

                                                                                                                                        1 Same answer independent of layout processors order of summands

                                                                                                                                        2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

                                                                                                                                        bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

                                                                                                                                        GoalsApproaches for Reproducibility

                                                                                                                                        104

                                                                                                                                        Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

                                                                                                                                        Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

                                                                                                                                        Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

                                                                                                                                        bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

                                                                                                                                        Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

                                                                                                                                        bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

                                                                                                                                        Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

                                                                                                                                        Instruments NEC Nokia NVIDIA Samsung Oracle

                                                                                                                                        bull bebopcsberkeleyedu

                                                                                                                                        Summary

                                                                                                                                        Donrsquot Communichellip

                                                                                                                                        106

                                                                                                                                        Time to redesign all linear algebra n-body hellip algorithms and software

                                                                                                                                        (and compilers)

                                                                                                                                        • Implementing Communication-Avoiding Algorithms
                                                                                                                                        • Why avoid communication
                                                                                                                                        • Goals
                                                                                                                                        • Outline
                                                                                                                                        • Outline (2)
                                                                                                                                        • Lower bound for all ldquon3-likerdquo linear algebra
                                                                                                                                        • Lower bound for all ldquon3-likerdquo linear algebra (2)
                                                                                                                                        • Lower bound for all ldquon3-likerdquo linear algebra (3)
                                                                                                                                        • Limits to parallel scaling (12)
                                                                                                                                        • Limits to parallel scaling (22)
                                                                                                                                        • Can we attain these lower bounds
                                                                                                                                        • Outline (3)
                                                                                                                                        • 25D Matrix Multiplication
                                                                                                                                        • 25D Matrix Multiplication (2)
                                                                                                                                        • 25D Matmul on BGP 16K nodes 64K cores (2)
                                                                                                                                        • Perfect Strong Scaling ndash in Time and Energy (12)
                                                                                                                                        • Perfect Strong Scaling ndash in Time and Energy (22)
                                                                                                                                        • Handling Heterogeneity
                                                                                                                                        • Application to Tensor Contractions
                                                                                                                                        • C(ijk) = Σm A(ijm)B(mk)
                                                                                                                                        • Application to Tensor Contractions (2)
                                                                                                                                        • Communication Lower Bounds for Strassen-like matmul algorithms
                                                                                                                                        • vs
                                                                                                                                        • Slide 26
                                                                                                                                        • Strassen-like beyond matmul
                                                                                                                                        • Cache and Network Oblivious Algorithms
                                                                                                                                        • CARMA Performance Distributed Memory
                                                                                                                                        • CARMA Performance Distributed Memory (2)
                                                                                                                                        • CARMA Performance Shared Memory
                                                                                                                                        • CARMA Performance Shared Memory (2)
                                                                                                                                        • Why is CARMA Faster in Shared Memory
                                                                                                                                        • Outline (4)
                                                                                                                                        • One-sided Factorizations (LU QR) so far
                                                                                                                                        • TSQR An Architecture-Dependent Algorithm
                                                                                                                                        • Back to LU Using similar idea for TSLU as TSQR Use reduction
                                                                                                                                        • Minimizing Communication in TSLU
                                                                                                                                        • Making TSLU Numerically Stable
                                                                                                                                        • Stability of LU using TSLU CALU
                                                                                                                                        • Why is stability of TSLU just a ldquoThmrdquo
                                                                                                                                        • Fixing TSLU
                                                                                                                                        • 2D CALU with Tournament Pivoting
                                                                                                                                        • 25D CALU with Tournament Pivoting (c=4 copies)
                                                                                                                                        • Exascale Machine Parameters Source DOE Exascale Workshop
                                                                                                                                        • Exascale predicted speedups for Gaussian Elimination 2D CA
                                                                                                                                        • 25D vs 2D LU With and Without Pivoting
                                                                                                                                        • Other CA algorithms for Ax=b least squares(13)
                                                                                                                                        • Other CA algorithms for Ax=b least squares (23)
                                                                                                                                        • Other CA algorithms for Ax=b least squares (33)
                                                                                                                                        • Outline (5)
                                                                                                                                        • What about sparse matrices (13)
                                                                                                                                        • Performance of 25D APSP using Kleene
                                                                                                                                        • What about sparse matrices (23)
                                                                                                                                        • What about sparse matrices (33)
                                                                                                                                        • Outline (6)
                                                                                                                                        • Symmetric Eigenproblem and SVD
                                                                                                                                        • Slide 58
                                                                                                                                        • Slide 59
                                                                                                                                        • Slide 60
                                                                                                                                        • Slide 61
                                                                                                                                        • Slide 62
                                                                                                                                        • Slide 63
                                                                                                                                        • Slide 64
                                                                                                                                        • Slide 65
                                                                                                                                        • Slide 66
                                                                                                                                        • Slide 67
                                                                                                                                        • Slide 68
                                                                                                                                        • Conventional vs CA - SBR
                                                                                                                                        • Speedups of Sym Band Reduction vs DSBTRD
                                                                                                                                        • Nonsymmetric Eigenproblem
                                                                                                                                        • Attaining the Lower bounds Sequential
                                                                                                                                        • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
                                                                                                                                        • Outline (7)
                                                                                                                                        • Avoiding Communication in Iterative Linear Algebra
                                                                                                                                        • Outline (8)
                                                                                                                                        • Example The Difficulty of Tuning SpMV
                                                                                                                                        • Example The Difficulty of Tuning
                                                                                                                                        • Speedups on Itanium 2 The Need for Search
                                                                                                                                        • Register Profile Itanium 2
                                                                                                                                        • Register Profiles IBM and Intel IA-64
                                                                                                                                        • Another example of tuning challenges for SpMV
                                                                                                                                        • Zoom in to top corner
                                                                                                                                        • 3x3 blocks look natural buthellip
                                                                                                                                        • Extra Work Can Improve Efficiency
                                                                                                                                        • Slide 86
                                                                                                                                        • Slide 87
                                                                                                                                        • Slide 88
                                                                                                                                        • Slide 89
                                                                                                                                        • Summary of Other Performance Optimizations
                                                                                                                                        • Optimized Sparse Kernel Interface - OSKI
                                                                                                                                        • Outline (9)
                                                                                                                                        • Example Classical Conjugate Gradient (CG)
                                                                                                                                        • Example CA-Conjugate Gradient
                                                                                                                                        • Outline (10)
                                                                                                                                        • Slide 96
                                                                                                                                        • Slide 97
                                                                                                                                        • Outline (11)
                                                                                                                                        • What is a ldquosparse matrixrdquo
                                                                                                                                        • Outline (12)
                                                                                                                                        • Reproducible Floating Point Computation
                                                                                                                                        • Intel MKL non-reproducibility
                                                                                                                                        • GoalsApproaches for Reproducibility
                                                                                                                                        • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
                                                                                                                                        • Collaborators and Supporters
                                                                                                                                        • Summary

                                                                                                                                          Nonsymmetric Eigenproblem

                                                                                                                                          bull No apparent way to modify standard algorithmbull Instead Spectral Divide-and-Conquer

                                                                                                                                          ndash Find orthogonal matrix Q whose leading columns span an invariant subspace of A

                                                                                                                                          ndash QTAQ will be block upper triangularndash Apply recursively to A11 A22

                                                                                                                                          ndash Depends on randomization1 Randomized Rank Revealing QR decomposition2 Randomized location to try splitting spectrum

                                                                                                                                          A11 A12

                                                                                                                                          ε A22

                                                                                                                                          Attaining the Lower bounds SequentialLegend[Existing][Ours][Math-Lib][Random]

                                                                                                                                          Two Levels Memory Hierarchy

                                                                                                                                          Words Messages Words Messages

                                                                                                                                          BLAS-3 [FLPRrsquo99][BDLSTrsquo13][MKL etc] [FLPRrsquo99][BDLSTrsquo13][MKL etc]

                                                                                                                                          Cholesky[Grsquo97][APrsquo00]

                                                                                                                                          [LAPACK][BDHSrsquo09]

                                                                                                                                          [Grsquo97][APrsquo00][BDHSrsquo09] [Grsquo97][APrsquo00][BDHSrsquo09]

                                                                                                                                          Sym Indefinite [BBDDDPSTYrsquo13] [BBDDDPSTYrsquo13]

                                                                                                                                          LU[Grsquo97][Trsquo97]

                                                                                                                                          [GDXrsquo11][BDLSTrsquo13]

                                                                                                                                          [GDXrsquo11][BDLSTrsquo13]

                                                                                                                                          [Grsquo97][Trsquo97] [BDLSTrsquo13] [BDLSTrsquo13]

                                                                                                                                          QR[EGrsquo98][FWrsquo03]

                                                                                                                                          [DGHLrsquo12][BDLSTrsquo13]

                                                                                                                                          [FWrsquo03][DGHLrsquo12][BDLSTrsquo13]

                                                                                                                                          [EGrsquo98][FWrsquo03][BDLSTrsquo13]

                                                                                                                                          [FWrsquo03][BDLSTrsquo13]

                                                                                                                                          Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

                                                                                                                                          Sym Eig amp SVD [BDDrsquo11][BDKrsquo13] [BDDrsquo11]

                                                                                                                                          Non Sym Eig [BDDrsquo11] [BDDrsquo11]

                                                                                                                                          Legend[Existing][Ours][Math-Lib][Random]

                                                                                                                                          Words (BW) Messages (L) Saving factor

                                                                                                                                          BLAS-3 [AGZrsquo94][MTrsquo99][ScaLAPACK][Crsquo69][vGWrsquo97][SDrsquo11] L nP12

                                                                                                                                          Cholesky [ScaLAPACK][Trsquo99][SDrsquo11] L nP12

                                                                                                                                          Sym Indefinite [BBDDDPSTYrsquo13][ScaLAPACK] [BBDDDPSTYrsquo13] L nP12

                                                                                                                                          LU [ScaLAPACK][GDXrsquo11][Trsquo99][SDrsquo11] [GDXrsquo11][Trsquo99][SDrsquo11] L nP12

                                                                                                                                          QR [ScaLAPACK][DGHLrsquo12] [Trsquo99] [DGHLrsquo12][Trsquo99] L nP12

                                                                                                                                          Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

                                                                                                                                          Sym Eig amp SVD [BDDrsquo11][BDKrsquo13][ScaLAPACK] [BDDrsquo11][BDKrsquo13] L nP12

                                                                                                                                          Non-Sym Eig [BDDrsquo11] [BDDrsquo11] BW P12 L n

                                                                                                                                          Attaining with extra memory 25D M=(cn2P)

                                                                                                                                          Attaining the Lower bounds Parallel 2DM=(n2P)(Ignoring poly-log(P) factors words = ( n2 P12) messages = (P12)

                                                                                                                                          Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                                          ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                                          ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                                          bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                                          bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                                          Avoiding Communication in Iterative Linear Algebra

                                                                                                                                          bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

                                                                                                                                          bull Conjugate Gradients (CG) GMRES Lanczos Arnoldi hellip bull Goal minimize communication

                                                                                                                                          ndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

                                                                                                                                          bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

                                                                                                                                          ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

                                                                                                                                          bull Lots of speed up possible (modeled and measured)ndash Price some redundant computationndash Challenges Poor partitioning Preconditioning Num Stability

                                                                                                                                          75

                                                                                                                                          Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                                          ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                                          ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                                          bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                                          bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                                          Example The Difficulty of Tuning SpMV

                                                                                                                                          bull n = 21200bull nnz = 15 M

                                                                                                                                          bull Source NASA structural analysis problem (raefsky)

                                                                                                                                          77

                                                                                                                                          Example The Difficulty of Tuning

                                                                                                                                          bull n = 21200bull nnz = 15 M

                                                                                                                                          bull Source NASA structural analysis problem (raefsky)

                                                                                                                                          bull 8x8 dense substructure exploit this to limit mem_refs

                                                                                                                                          78

                                                                                                                                          Speedups on Itanium 2 The Need for Search

                                                                                                                                          Reference

                                                                                                                                          Best 4x2

                                                                                                                                          Mflops

                                                                                                                                          Mflops

                                                                                                                                          79

                                                                                                                                          Register Profile Itanium 2

                                                                                                                                          190 Mflops

                                                                                                                                          1190 Mflops

                                                                                                                                          80

                                                                                                                                          Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16

                                                                                                                                          Itanium 2 - 33Itanium 1 - 8

                                                                                                                                          252 Mflops

                                                                                                                                          122 Mflops

                                                                                                                                          820 Mflops

                                                                                                                                          459 Mflops

                                                                                                                                          247 Mflops

                                                                                                                                          107 Mflops

                                                                                                                                          12 Gflops

                                                                                                                                          190 Mflops

                                                                                                                                          Another example of tuning challenges for SpMV

                                                                                                                                          bull Ex11 matrix (fluid flow)

                                                                                                                                          bull More complicated non-zero structure in general

                                                                                                                                          bull N = 16614bull NNZ = 11M

                                                                                                                                          82

                                                                                                                                          Zoom in to top corner

                                                                                                                                          bull More complicated non-zero structure in general

                                                                                                                                          bull N = 16614bull NNZ = 11M

                                                                                                                                          83

                                                                                                                                          3x3 blocks look natural buthellip

                                                                                                                                          bull Example 3x3 blockingndash Logical grid of 3x3 cells

                                                                                                                                          bull But would lead to lots of ldquofill-inrdquo

                                                                                                                                          84

                                                                                                                                          Extra Work Can Improve Efficiency

                                                                                                                                          bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

                                                                                                                                          bull On Pentium III 15x speedupndash Actual mflop rate 152 = 225 higher

                                                                                                                                          85

                                                                                                                                          Source Accelerator Cavity Design Problem (Ko via Husbands)

                                                                                                                                          86

                                                                                                                                          100x100 Submatrix Along Diagonal

                                                                                                                                          Summer School Lecture 787

                                                                                                                                          Post-RCM Reordering

                                                                                                                                          88

                                                                                                                                          Effect of Combined RCM+TSP Reordering

                                                                                                                                          Before Green + RedAfter Green + Blue

                                                                                                                                          Summer School Lecture 789

                                                                                                                                          2x speedups on Pentium 4 Power 4 hellip

                                                                                                                                          Summary of Other Performance Optimizations

                                                                                                                                          bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

                                                                                                                                          bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

                                                                                                                                          bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

                                                                                                                                          90

                                                                                                                                          Optimized Sparse Kernel Interface - OSKI

                                                                                                                                          bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

                                                                                                                                          bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

                                                                                                                                          bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

                                                                                                                                          software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

                                                                                                                                          91

                                                                                                                                          Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                                          ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                                          ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                                          bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                                          bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                                          93

                                                                                                                                          Example Classical Conjugate Gradient (CG)

                                                                                                                                          SpMVs and dot products require communication in

                                                                                                                                          each iteration

                                                                                                                                          via CA Matrix Powers Kernel

                                                                                                                                          Global reduction to compute G

                                                                                                                                          94

                                                                                                                                          Example CA-Conjugate Gradient

                                                                                                                                          Local computations within inner loop require

                                                                                                                                          no communication

                                                                                                                                          Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                                          ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                                          ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                                          bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                                          bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                                          96

                                                                                                                                          Slower convergence due

                                                                                                                                          to roundoff

                                                                                                                                          Loss of accuracy due to roundoff

                                                                                                                                          At s = 16 monomial basis is rank deficient Method breaks down

                                                                                                                                          Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

                                                                                                                                          CA-CG (monomial)CG

                                                                                                                                          machine precision

                                                                                                                                          97

                                                                                                                                          Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                                          ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                                          ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                                          bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                                          bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                                          What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

                                                                                                                                          bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

                                                                                                                                          bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

                                                                                                                                          matrices

                                                                                                                                          Explicit (O(nnz)) Implicit (o(nnz))

                                                                                                                                          Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

                                                                                                                                          Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

                                                                                                                                          Indices

                                                                                                                                          Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                                          ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                                          ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                                          bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                                          bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                                          101

                                                                                                                                          bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

                                                                                                                                          ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

                                                                                                                                          demanded by customers (construction engineers) otherwise they donrsquot believe results

                                                                                                                                          ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

                                                                                                                                          ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

                                                                                                                                          Reproducible Floating Point Computation

                                                                                                                                          Absolute Error for Random Vectors

                                                                                                                                          Same magnitude opposite signs

                                                                                                                                          Intel MKL non-reproducibility

                                                                                                                                          Relative Error for Orthogonal vectors

                                                                                                                                          Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

                                                                                                                                          Sign notreproducible

                                                                                                                                          103

                                                                                                                                          bull Consider summation or dot productbull Goals

                                                                                                                                          1 Same answer independent of layout processors order of summands

                                                                                                                                          2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

                                                                                                                                          bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

                                                                                                                                          GoalsApproaches for Reproducibility

                                                                                                                                          104

                                                                                                                                          Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

                                                                                                                                          Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

                                                                                                                                          Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

                                                                                                                                          bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

                                                                                                                                          Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

                                                                                                                                          bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

                                                                                                                                          Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

                                                                                                                                          Instruments NEC Nokia NVIDIA Samsung Oracle

                                                                                                                                          bull bebopcsberkeleyedu

                                                                                                                                          Summary

                                                                                                                                          Donrsquot Communichellip

                                                                                                                                          106

                                                                                                                                          Time to redesign all linear algebra n-body hellip algorithms and software

                                                                                                                                          (and compilers)

                                                                                                                                          • Implementing Communication-Avoiding Algorithms
                                                                                                                                          • Why avoid communication
                                                                                                                                          • Goals
                                                                                                                                          • Outline
                                                                                                                                          • Outline (2)
                                                                                                                                          • Lower bound for all ldquon3-likerdquo linear algebra
                                                                                                                                          • Lower bound for all ldquon3-likerdquo linear algebra (2)
                                                                                                                                          • Lower bound for all ldquon3-likerdquo linear algebra (3)
                                                                                                                                          • Limits to parallel scaling (12)
                                                                                                                                          • Limits to parallel scaling (22)
                                                                                                                                          • Can we attain these lower bounds
                                                                                                                                          • Outline (3)
                                                                                                                                          • 25D Matrix Multiplication
                                                                                                                                          • 25D Matrix Multiplication (2)
                                                                                                                                          • 25D Matmul on BGP 16K nodes 64K cores (2)
                                                                                                                                          • Perfect Strong Scaling ndash in Time and Energy (12)
                                                                                                                                          • Perfect Strong Scaling ndash in Time and Energy (22)
                                                                                                                                          • Handling Heterogeneity
                                                                                                                                          • Application to Tensor Contractions
                                                                                                                                          • C(ijk) = Σm A(ijm)B(mk)
                                                                                                                                          • Application to Tensor Contractions (2)
                                                                                                                                          • Communication Lower Bounds for Strassen-like matmul algorithms
                                                                                                                                          • vs
                                                                                                                                          • Slide 26
                                                                                                                                          • Strassen-like beyond matmul
                                                                                                                                          • Cache and Network Oblivious Algorithms
                                                                                                                                          • CARMA Performance Distributed Memory
                                                                                                                                          • CARMA Performance Distributed Memory (2)
                                                                                                                                          • CARMA Performance Shared Memory
                                                                                                                                          • CARMA Performance Shared Memory (2)
                                                                                                                                          • Why is CARMA Faster in Shared Memory
                                                                                                                                          • Outline (4)
                                                                                                                                          • One-sided Factorizations (LU QR) so far
                                                                                                                                          • TSQR An Architecture-Dependent Algorithm
                                                                                                                                          • Back to LU Using similar idea for TSLU as TSQR Use reduction
                                                                                                                                          • Minimizing Communication in TSLU
                                                                                                                                          • Making TSLU Numerically Stable
                                                                                                                                          • Stability of LU using TSLU CALU
                                                                                                                                          • Why is stability of TSLU just a ldquoThmrdquo
                                                                                                                                          • Fixing TSLU
                                                                                                                                          • 2D CALU with Tournament Pivoting
                                                                                                                                          • 25D CALU with Tournament Pivoting (c=4 copies)
                                                                                                                                          • Exascale Machine Parameters Source DOE Exascale Workshop
                                                                                                                                          • Exascale predicted speedups for Gaussian Elimination 2D CA
                                                                                                                                          • 25D vs 2D LU With and Without Pivoting
                                                                                                                                          • Other CA algorithms for Ax=b least squares(13)
                                                                                                                                          • Other CA algorithms for Ax=b least squares (23)
                                                                                                                                          • Other CA algorithms for Ax=b least squares (33)
                                                                                                                                          • Outline (5)
                                                                                                                                          • What about sparse matrices (13)
                                                                                                                                          • Performance of 25D APSP using Kleene
                                                                                                                                          • What about sparse matrices (23)
                                                                                                                                          • What about sparse matrices (33)
                                                                                                                                          • Outline (6)
                                                                                                                                          • Symmetric Eigenproblem and SVD
                                                                                                                                          • Slide 58
                                                                                                                                          • Slide 59
                                                                                                                                          • Slide 60
                                                                                                                                          • Slide 61
                                                                                                                                          • Slide 62
                                                                                                                                          • Slide 63
                                                                                                                                          • Slide 64
                                                                                                                                          • Slide 65
                                                                                                                                          • Slide 66
                                                                                                                                          • Slide 67
                                                                                                                                          • Slide 68
                                                                                                                                          • Conventional vs CA - SBR
                                                                                                                                          • Speedups of Sym Band Reduction vs DSBTRD
                                                                                                                                          • Nonsymmetric Eigenproblem
                                                                                                                                          • Attaining the Lower bounds Sequential
                                                                                                                                          • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
                                                                                                                                          • Outline (7)
                                                                                                                                          • Avoiding Communication in Iterative Linear Algebra
                                                                                                                                          • Outline (8)
                                                                                                                                          • Example The Difficulty of Tuning SpMV
                                                                                                                                          • Example The Difficulty of Tuning
                                                                                                                                          • Speedups on Itanium 2 The Need for Search
                                                                                                                                          • Register Profile Itanium 2
                                                                                                                                          • Register Profiles IBM and Intel IA-64
                                                                                                                                          • Another example of tuning challenges for SpMV
                                                                                                                                          • Zoom in to top corner
                                                                                                                                          • 3x3 blocks look natural buthellip
                                                                                                                                          • Extra Work Can Improve Efficiency
                                                                                                                                          • Slide 86
                                                                                                                                          • Slide 87
                                                                                                                                          • Slide 88
                                                                                                                                          • Slide 89
                                                                                                                                          • Summary of Other Performance Optimizations
                                                                                                                                          • Optimized Sparse Kernel Interface - OSKI
                                                                                                                                          • Outline (9)
                                                                                                                                          • Example Classical Conjugate Gradient (CG)
                                                                                                                                          • Example CA-Conjugate Gradient
                                                                                                                                          • Outline (10)
                                                                                                                                          • Slide 96
                                                                                                                                          • Slide 97
                                                                                                                                          • Outline (11)
                                                                                                                                          • What is a ldquosparse matrixrdquo
                                                                                                                                          • Outline (12)
                                                                                                                                          • Reproducible Floating Point Computation
                                                                                                                                          • Intel MKL non-reproducibility
                                                                                                                                          • GoalsApproaches for Reproducibility
                                                                                                                                          • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
                                                                                                                                          • Collaborators and Supporters
                                                                                                                                          • Summary

                                                                                                                                            Attaining the Lower bounds SequentialLegend[Existing][Ours][Math-Lib][Random]

                                                                                                                                            Two Levels Memory Hierarchy

                                                                                                                                            Words Messages Words Messages

                                                                                                                                            BLAS-3 [FLPRrsquo99][BDLSTrsquo13][MKL etc] [FLPRrsquo99][BDLSTrsquo13][MKL etc]

                                                                                                                                            Cholesky[Grsquo97][APrsquo00]

                                                                                                                                            [LAPACK][BDHSrsquo09]

                                                                                                                                            [Grsquo97][APrsquo00][BDHSrsquo09] [Grsquo97][APrsquo00][BDHSrsquo09]

                                                                                                                                            Sym Indefinite [BBDDDPSTYrsquo13] [BBDDDPSTYrsquo13]

                                                                                                                                            LU[Grsquo97][Trsquo97]

                                                                                                                                            [GDXrsquo11][BDLSTrsquo13]

                                                                                                                                            [GDXrsquo11][BDLSTrsquo13]

                                                                                                                                            [Grsquo97][Trsquo97] [BDLSTrsquo13] [BDLSTrsquo13]

                                                                                                                                            QR[EGrsquo98][FWrsquo03]

                                                                                                                                            [DGHLrsquo12][BDLSTrsquo13]

                                                                                                                                            [FWrsquo03][DGHLrsquo12][BDLSTrsquo13]

                                                                                                                                            [EGrsquo98][FWrsquo03][BDLSTrsquo13]

                                                                                                                                            [FWrsquo03][BDLSTrsquo13]

                                                                                                                                            Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

                                                                                                                                            Sym Eig amp SVD [BDDrsquo11][BDKrsquo13] [BDDrsquo11]

                                                                                                                                            Non Sym Eig [BDDrsquo11] [BDDrsquo11]

                                                                                                                                            Legend[Existing][Ours][Math-Lib][Random]

                                                                                                                                            Words (BW) Messages (L) Saving factor

                                                                                                                                            BLAS-3 [AGZrsquo94][MTrsquo99][ScaLAPACK][Crsquo69][vGWrsquo97][SDrsquo11] L nP12

                                                                                                                                            Cholesky [ScaLAPACK][Trsquo99][SDrsquo11] L nP12

                                                                                                                                            Sym Indefinite [BBDDDPSTYrsquo13][ScaLAPACK] [BBDDDPSTYrsquo13] L nP12

                                                                                                                                            LU [ScaLAPACK][GDXrsquo11][Trsquo99][SDrsquo11] [GDXrsquo11][Trsquo99][SDrsquo11] L nP12

                                                                                                                                            QR [ScaLAPACK][DGHLrsquo12] [Trsquo99] [DGHLrsquo12][Trsquo99] L nP12

                                                                                                                                            Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

                                                                                                                                            Sym Eig amp SVD [BDDrsquo11][BDKrsquo13][ScaLAPACK] [BDDrsquo11][BDKrsquo13] L nP12

                                                                                                                                            Non-Sym Eig [BDDrsquo11] [BDDrsquo11] BW P12 L n

                                                                                                                                            Attaining with extra memory 25D M=(cn2P)

                                                                                                                                            Attaining the Lower bounds Parallel 2DM=(n2P)(Ignoring poly-log(P) factors words = ( n2 P12) messages = (P12)

                                                                                                                                            Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                                            ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                                            ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                                            bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                                            bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                                            Avoiding Communication in Iterative Linear Algebra

                                                                                                                                            bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

                                                                                                                                            bull Conjugate Gradients (CG) GMRES Lanczos Arnoldi hellip bull Goal minimize communication

                                                                                                                                            ndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

                                                                                                                                            bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

                                                                                                                                            ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

                                                                                                                                            bull Lots of speed up possible (modeled and measured)ndash Price some redundant computationndash Challenges Poor partitioning Preconditioning Num Stability

                                                                                                                                            75

                                                                                                                                            Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                                            ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                                            ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                                            bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                                            bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                                            Example The Difficulty of Tuning SpMV

                                                                                                                                            bull n = 21200bull nnz = 15 M

                                                                                                                                            bull Source NASA structural analysis problem (raefsky)

                                                                                                                                            77

                                                                                                                                            Example The Difficulty of Tuning

                                                                                                                                            bull n = 21200bull nnz = 15 M

                                                                                                                                            bull Source NASA structural analysis problem (raefsky)

                                                                                                                                            bull 8x8 dense substructure exploit this to limit mem_refs

                                                                                                                                            78

                                                                                                                                            Speedups on Itanium 2 The Need for Search

                                                                                                                                            Reference

                                                                                                                                            Best 4x2

                                                                                                                                            Mflops

                                                                                                                                            Mflops

                                                                                                                                            79

                                                                                                                                            Register Profile Itanium 2

                                                                                                                                            190 Mflops

                                                                                                                                            1190 Mflops

                                                                                                                                            80

                                                                                                                                            Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16

                                                                                                                                            Itanium 2 - 33Itanium 1 - 8

                                                                                                                                            252 Mflops

                                                                                                                                            122 Mflops

                                                                                                                                            820 Mflops

                                                                                                                                            459 Mflops

                                                                                                                                            247 Mflops

                                                                                                                                            107 Mflops

                                                                                                                                            12 Gflops

                                                                                                                                            190 Mflops

                                                                                                                                            Another example of tuning challenges for SpMV

                                                                                                                                            bull Ex11 matrix (fluid flow)

                                                                                                                                            bull More complicated non-zero structure in general

                                                                                                                                            bull N = 16614bull NNZ = 11M

                                                                                                                                            82

                                                                                                                                            Zoom in to top corner

                                                                                                                                            bull More complicated non-zero structure in general

                                                                                                                                            bull N = 16614bull NNZ = 11M

                                                                                                                                            83

                                                                                                                                            3x3 blocks look natural buthellip

                                                                                                                                            bull Example 3x3 blockingndash Logical grid of 3x3 cells

                                                                                                                                            bull But would lead to lots of ldquofill-inrdquo

                                                                                                                                            84

                                                                                                                                            Extra Work Can Improve Efficiency

                                                                                                                                            bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

                                                                                                                                            bull On Pentium III 15x speedupndash Actual mflop rate 152 = 225 higher

                                                                                                                                            85

                                                                                                                                            Source Accelerator Cavity Design Problem (Ko via Husbands)

                                                                                                                                            86

                                                                                                                                            100x100 Submatrix Along Diagonal

                                                                                                                                            Summer School Lecture 787

                                                                                                                                            Post-RCM Reordering

                                                                                                                                            88

                                                                                                                                            Effect of Combined RCM+TSP Reordering

                                                                                                                                            Before Green + RedAfter Green + Blue

                                                                                                                                            Summer School Lecture 789

                                                                                                                                            2x speedups on Pentium 4 Power 4 hellip

                                                                                                                                            Summary of Other Performance Optimizations

                                                                                                                                            bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

                                                                                                                                            bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

                                                                                                                                            bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

                                                                                                                                            90

                                                                                                                                            Optimized Sparse Kernel Interface - OSKI

                                                                                                                                            bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

                                                                                                                                            bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

                                                                                                                                            bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

                                                                                                                                            software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

                                                                                                                                            91

                                                                                                                                            Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                                            ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                                            ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                                            bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                                            bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                                            93

                                                                                                                                            Example Classical Conjugate Gradient (CG)

                                                                                                                                            SpMVs and dot products require communication in

                                                                                                                                            each iteration

                                                                                                                                            via CA Matrix Powers Kernel

                                                                                                                                            Global reduction to compute G

                                                                                                                                            94

                                                                                                                                            Example CA-Conjugate Gradient

                                                                                                                                            Local computations within inner loop require

                                                                                                                                            no communication

                                                                                                                                            Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                                            ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                                            ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                                            bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                                            bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                                            96

                                                                                                                                            Slower convergence due

                                                                                                                                            to roundoff

                                                                                                                                            Loss of accuracy due to roundoff

                                                                                                                                            At s = 16 monomial basis is rank deficient Method breaks down

                                                                                                                                            Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

                                                                                                                                            CA-CG (monomial)CG

                                                                                                                                            machine precision

                                                                                                                                            97

                                                                                                                                            Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                                            ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                                            ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                                            bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                                            bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                                            What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

                                                                                                                                            bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

                                                                                                                                            bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

                                                                                                                                            matrices

                                                                                                                                            Explicit (O(nnz)) Implicit (o(nnz))

                                                                                                                                            Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

                                                                                                                                            Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

                                                                                                                                            Indices

                                                                                                                                            Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                                            ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                                            ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                                            bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                                            bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                                            101

                                                                                                                                            bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

                                                                                                                                            ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

                                                                                                                                            demanded by customers (construction engineers) otherwise they donrsquot believe results

                                                                                                                                            ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

                                                                                                                                            ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

                                                                                                                                            Reproducible Floating Point Computation

                                                                                                                                            Absolute Error for Random Vectors

                                                                                                                                            Same magnitude opposite signs

                                                                                                                                            Intel MKL non-reproducibility

                                                                                                                                            Relative Error for Orthogonal vectors

                                                                                                                                            Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

                                                                                                                                            Sign notreproducible

                                                                                                                                            103

                                                                                                                                            bull Consider summation or dot productbull Goals

                                                                                                                                            1 Same answer independent of layout processors order of summands

                                                                                                                                            2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

                                                                                                                                            bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

                                                                                                                                            GoalsApproaches for Reproducibility

                                                                                                                                            104

                                                                                                                                            Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

                                                                                                                                            Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

                                                                                                                                            Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

                                                                                                                                            bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

                                                                                                                                            Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

                                                                                                                                            bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

                                                                                                                                            Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

                                                                                                                                            Instruments NEC Nokia NVIDIA Samsung Oracle

                                                                                                                                            bull bebopcsberkeleyedu

                                                                                                                                            Summary

                                                                                                                                            Donrsquot Communichellip

                                                                                                                                            106

                                                                                                                                            Time to redesign all linear algebra n-body hellip algorithms and software

                                                                                                                                            (and compilers)

                                                                                                                                            • Implementing Communication-Avoiding Algorithms
                                                                                                                                            • Why avoid communication
                                                                                                                                            • Goals
                                                                                                                                            • Outline
                                                                                                                                            • Outline (2)
                                                                                                                                            • Lower bound for all ldquon3-likerdquo linear algebra
                                                                                                                                            • Lower bound for all ldquon3-likerdquo linear algebra (2)
                                                                                                                                            • Lower bound for all ldquon3-likerdquo linear algebra (3)
                                                                                                                                            • Limits to parallel scaling (12)
                                                                                                                                            • Limits to parallel scaling (22)
                                                                                                                                            • Can we attain these lower bounds
                                                                                                                                            • Outline (3)
                                                                                                                                            • 25D Matrix Multiplication
                                                                                                                                            • 25D Matrix Multiplication (2)
                                                                                                                                            • 25D Matmul on BGP 16K nodes 64K cores (2)
                                                                                                                                            • Perfect Strong Scaling ndash in Time and Energy (12)
                                                                                                                                            • Perfect Strong Scaling ndash in Time and Energy (22)
                                                                                                                                            • Handling Heterogeneity
                                                                                                                                            • Application to Tensor Contractions
                                                                                                                                            • C(ijk) = Σm A(ijm)B(mk)
                                                                                                                                            • Application to Tensor Contractions (2)
                                                                                                                                            • Communication Lower Bounds for Strassen-like matmul algorithms
                                                                                                                                            • vs
                                                                                                                                            • Slide 26
                                                                                                                                            • Strassen-like beyond matmul
                                                                                                                                            • Cache and Network Oblivious Algorithms
                                                                                                                                            • CARMA Performance Distributed Memory
                                                                                                                                            • CARMA Performance Distributed Memory (2)
                                                                                                                                            • CARMA Performance Shared Memory
                                                                                                                                            • CARMA Performance Shared Memory (2)
                                                                                                                                            • Why is CARMA Faster in Shared Memory
                                                                                                                                            • Outline (4)
                                                                                                                                            • One-sided Factorizations (LU QR) so far
                                                                                                                                            • TSQR An Architecture-Dependent Algorithm
                                                                                                                                            • Back to LU Using similar idea for TSLU as TSQR Use reduction
                                                                                                                                            • Minimizing Communication in TSLU
                                                                                                                                            • Making TSLU Numerically Stable
                                                                                                                                            • Stability of LU using TSLU CALU
                                                                                                                                            • Why is stability of TSLU just a ldquoThmrdquo
                                                                                                                                            • Fixing TSLU
                                                                                                                                            • 2D CALU with Tournament Pivoting
                                                                                                                                            • 25D CALU with Tournament Pivoting (c=4 copies)
                                                                                                                                            • Exascale Machine Parameters Source DOE Exascale Workshop
                                                                                                                                            • Exascale predicted speedups for Gaussian Elimination 2D CA
                                                                                                                                            • 25D vs 2D LU With and Without Pivoting
                                                                                                                                            • Other CA algorithms for Ax=b least squares(13)
                                                                                                                                            • Other CA algorithms for Ax=b least squares (23)
                                                                                                                                            • Other CA algorithms for Ax=b least squares (33)
                                                                                                                                            • Outline (5)
                                                                                                                                            • What about sparse matrices (13)
                                                                                                                                            • Performance of 25D APSP using Kleene
                                                                                                                                            • What about sparse matrices (23)
                                                                                                                                            • What about sparse matrices (33)
                                                                                                                                            • Outline (6)
                                                                                                                                            • Symmetric Eigenproblem and SVD
                                                                                                                                            • Slide 58
                                                                                                                                            • Slide 59
                                                                                                                                            • Slide 60
                                                                                                                                            • Slide 61
                                                                                                                                            • Slide 62
                                                                                                                                            • Slide 63
                                                                                                                                            • Slide 64
                                                                                                                                            • Slide 65
                                                                                                                                            • Slide 66
                                                                                                                                            • Slide 67
                                                                                                                                            • Slide 68
                                                                                                                                            • Conventional vs CA - SBR
                                                                                                                                            • Speedups of Sym Band Reduction vs DSBTRD
                                                                                                                                            • Nonsymmetric Eigenproblem
                                                                                                                                            • Attaining the Lower bounds Sequential
                                                                                                                                            • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
                                                                                                                                            • Outline (7)
                                                                                                                                            • Avoiding Communication in Iterative Linear Algebra
                                                                                                                                            • Outline (8)
                                                                                                                                            • Example The Difficulty of Tuning SpMV
                                                                                                                                            • Example The Difficulty of Tuning
                                                                                                                                            • Speedups on Itanium 2 The Need for Search
                                                                                                                                            • Register Profile Itanium 2
                                                                                                                                            • Register Profiles IBM and Intel IA-64
                                                                                                                                            • Another example of tuning challenges for SpMV
                                                                                                                                            • Zoom in to top corner
                                                                                                                                            • 3x3 blocks look natural buthellip
                                                                                                                                            • Extra Work Can Improve Efficiency
                                                                                                                                            • Slide 86
                                                                                                                                            • Slide 87
                                                                                                                                            • Slide 88
                                                                                                                                            • Slide 89
                                                                                                                                            • Summary of Other Performance Optimizations
                                                                                                                                            • Optimized Sparse Kernel Interface - OSKI
                                                                                                                                            • Outline (9)
                                                                                                                                            • Example Classical Conjugate Gradient (CG)
                                                                                                                                            • Example CA-Conjugate Gradient
                                                                                                                                            • Outline (10)
                                                                                                                                            • Slide 96
                                                                                                                                            • Slide 97
                                                                                                                                            • Outline (11)
                                                                                                                                            • What is a ldquosparse matrixrdquo
                                                                                                                                            • Outline (12)
                                                                                                                                            • Reproducible Floating Point Computation
                                                                                                                                            • Intel MKL non-reproducibility
                                                                                                                                            • GoalsApproaches for Reproducibility
                                                                                                                                            • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
                                                                                                                                            • Collaborators and Supporters
                                                                                                                                            • Summary

                                                                                                                                              Legend[Existing][Ours][Math-Lib][Random]

                                                                                                                                              Words (BW) Messages (L) Saving factor

                                                                                                                                              BLAS-3 [AGZrsquo94][MTrsquo99][ScaLAPACK][Crsquo69][vGWrsquo97][SDrsquo11] L nP12

                                                                                                                                              Cholesky [ScaLAPACK][Trsquo99][SDrsquo11] L nP12

                                                                                                                                              Sym Indefinite [BBDDDPSTYrsquo13][ScaLAPACK] [BBDDDPSTYrsquo13] L nP12

                                                                                                                                              LU [ScaLAPACK][GDXrsquo11][Trsquo99][SDrsquo11] [GDXrsquo11][Trsquo99][SDrsquo11] L nP12

                                                                                                                                              QR [ScaLAPACK][DGHLrsquo12] [Trsquo99] [DGHLrsquo12][Trsquo99] L nP12

                                                                                                                                              Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

                                                                                                                                              Sym Eig amp SVD [BDDrsquo11][BDKrsquo13][ScaLAPACK] [BDDrsquo11][BDKrsquo13] L nP12

                                                                                                                                              Non-Sym Eig [BDDrsquo11] [BDDrsquo11] BW P12 L n

                                                                                                                                              Attaining with extra memory 25D M=(cn2P)

                                                                                                                                              Attaining the Lower bounds Parallel 2DM=(n2P)(Ignoring poly-log(P) factors words = ( n2 P12) messages = (P12)

                                                                                                                                              Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                                              ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                                              ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                                              bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                                              bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                                              Avoiding Communication in Iterative Linear Algebra

                                                                                                                                              bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

                                                                                                                                              bull Conjugate Gradients (CG) GMRES Lanczos Arnoldi hellip bull Goal minimize communication

                                                                                                                                              ndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

                                                                                                                                              bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

                                                                                                                                              ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

                                                                                                                                              bull Lots of speed up possible (modeled and measured)ndash Price some redundant computationndash Challenges Poor partitioning Preconditioning Num Stability

                                                                                                                                              75

                                                                                                                                              Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                                              ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                                              ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                                              bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                                              bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                                              Example The Difficulty of Tuning SpMV

                                                                                                                                              bull n = 21200bull nnz = 15 M

                                                                                                                                              bull Source NASA structural analysis problem (raefsky)

                                                                                                                                              77

                                                                                                                                              Example The Difficulty of Tuning

                                                                                                                                              bull n = 21200bull nnz = 15 M

                                                                                                                                              bull Source NASA structural analysis problem (raefsky)

                                                                                                                                              bull 8x8 dense substructure exploit this to limit mem_refs

                                                                                                                                              78

                                                                                                                                              Speedups on Itanium 2 The Need for Search

                                                                                                                                              Reference

                                                                                                                                              Best 4x2

                                                                                                                                              Mflops

                                                                                                                                              Mflops

                                                                                                                                              79

                                                                                                                                              Register Profile Itanium 2

                                                                                                                                              190 Mflops

                                                                                                                                              1190 Mflops

                                                                                                                                              80

                                                                                                                                              Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16

                                                                                                                                              Itanium 2 - 33Itanium 1 - 8

                                                                                                                                              252 Mflops

                                                                                                                                              122 Mflops

                                                                                                                                              820 Mflops

                                                                                                                                              459 Mflops

                                                                                                                                              247 Mflops

                                                                                                                                              107 Mflops

                                                                                                                                              12 Gflops

                                                                                                                                              190 Mflops

                                                                                                                                              Another example of tuning challenges for SpMV

                                                                                                                                              bull Ex11 matrix (fluid flow)

                                                                                                                                              bull More complicated non-zero structure in general

                                                                                                                                              bull N = 16614bull NNZ = 11M

                                                                                                                                              82

                                                                                                                                              Zoom in to top corner

                                                                                                                                              bull More complicated non-zero structure in general

                                                                                                                                              bull N = 16614bull NNZ = 11M

                                                                                                                                              83

                                                                                                                                              3x3 blocks look natural buthellip

                                                                                                                                              bull Example 3x3 blockingndash Logical grid of 3x3 cells

                                                                                                                                              bull But would lead to lots of ldquofill-inrdquo

                                                                                                                                              84

                                                                                                                                              Extra Work Can Improve Efficiency

                                                                                                                                              bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

                                                                                                                                              bull On Pentium III 15x speedupndash Actual mflop rate 152 = 225 higher

                                                                                                                                              85

                                                                                                                                              Source Accelerator Cavity Design Problem (Ko via Husbands)

                                                                                                                                              86

                                                                                                                                              100x100 Submatrix Along Diagonal

                                                                                                                                              Summer School Lecture 787

                                                                                                                                              Post-RCM Reordering

                                                                                                                                              88

                                                                                                                                              Effect of Combined RCM+TSP Reordering

                                                                                                                                              Before Green + RedAfter Green + Blue

                                                                                                                                              Summer School Lecture 789

                                                                                                                                              2x speedups on Pentium 4 Power 4 hellip

                                                                                                                                              Summary of Other Performance Optimizations

                                                                                                                                              bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

                                                                                                                                              bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

                                                                                                                                              bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

                                                                                                                                              90

                                                                                                                                              Optimized Sparse Kernel Interface - OSKI

                                                                                                                                              bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

                                                                                                                                              bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

                                                                                                                                              bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

                                                                                                                                              software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

                                                                                                                                              91

                                                                                                                                              Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                                              ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                                              ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                                              bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                                              bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                                              93

                                                                                                                                              Example Classical Conjugate Gradient (CG)

                                                                                                                                              SpMVs and dot products require communication in

                                                                                                                                              each iteration

                                                                                                                                              via CA Matrix Powers Kernel

                                                                                                                                              Global reduction to compute G

                                                                                                                                              94

                                                                                                                                              Example CA-Conjugate Gradient

                                                                                                                                              Local computations within inner loop require

                                                                                                                                              no communication

                                                                                                                                              Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                                              ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                                              ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                                              bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                                              bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                                              96

                                                                                                                                              Slower convergence due

                                                                                                                                              to roundoff

                                                                                                                                              Loss of accuracy due to roundoff

                                                                                                                                              At s = 16 monomial basis is rank deficient Method breaks down

                                                                                                                                              Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

                                                                                                                                              CA-CG (monomial)CG

                                                                                                                                              machine precision

                                                                                                                                              97

                                                                                                                                              Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                                              ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                                              ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                                              bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                                              bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                                              What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

                                                                                                                                              bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

                                                                                                                                              bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

                                                                                                                                              matrices

                                                                                                                                              Explicit (O(nnz)) Implicit (o(nnz))

                                                                                                                                              Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

                                                                                                                                              Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

                                                                                                                                              Indices

                                                                                                                                              Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                                              ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                                              ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                                              bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                                              bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                                              101

                                                                                                                                              bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

                                                                                                                                              ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

                                                                                                                                              demanded by customers (construction engineers) otherwise they donrsquot believe results

                                                                                                                                              ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

                                                                                                                                              ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

                                                                                                                                              Reproducible Floating Point Computation

                                                                                                                                              Absolute Error for Random Vectors

                                                                                                                                              Same magnitude opposite signs

                                                                                                                                              Intel MKL non-reproducibility

                                                                                                                                              Relative Error for Orthogonal vectors

                                                                                                                                              Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

                                                                                                                                              Sign notreproducible

                                                                                                                                              103

                                                                                                                                              bull Consider summation or dot productbull Goals

                                                                                                                                              1 Same answer independent of layout processors order of summands

                                                                                                                                              2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

                                                                                                                                              bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

                                                                                                                                              GoalsApproaches for Reproducibility

                                                                                                                                              104

                                                                                                                                              Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

                                                                                                                                              Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

                                                                                                                                              Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

                                                                                                                                              bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

                                                                                                                                              Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

                                                                                                                                              bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

                                                                                                                                              Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

                                                                                                                                              Instruments NEC Nokia NVIDIA Samsung Oracle

                                                                                                                                              bull bebopcsberkeleyedu

                                                                                                                                              Summary

                                                                                                                                              Donrsquot Communichellip

                                                                                                                                              106

                                                                                                                                              Time to redesign all linear algebra n-body hellip algorithms and software

                                                                                                                                              (and compilers)

                                                                                                                                              • Implementing Communication-Avoiding Algorithms
                                                                                                                                              • Why avoid communication
                                                                                                                                              • Goals
                                                                                                                                              • Outline
                                                                                                                                              • Outline (2)
                                                                                                                                              • Lower bound for all ldquon3-likerdquo linear algebra
                                                                                                                                              • Lower bound for all ldquon3-likerdquo linear algebra (2)
                                                                                                                                              • Lower bound for all ldquon3-likerdquo linear algebra (3)
                                                                                                                                              • Limits to parallel scaling (12)
                                                                                                                                              • Limits to parallel scaling (22)
                                                                                                                                              • Can we attain these lower bounds
                                                                                                                                              • Outline (3)
                                                                                                                                              • 25D Matrix Multiplication
                                                                                                                                              • 25D Matrix Multiplication (2)
                                                                                                                                              • 25D Matmul on BGP 16K nodes 64K cores (2)
                                                                                                                                              • Perfect Strong Scaling ndash in Time and Energy (12)
                                                                                                                                              • Perfect Strong Scaling ndash in Time and Energy (22)
                                                                                                                                              • Handling Heterogeneity
                                                                                                                                              • Application to Tensor Contractions
                                                                                                                                              • C(ijk) = Σm A(ijm)B(mk)
                                                                                                                                              • Application to Tensor Contractions (2)
                                                                                                                                              • Communication Lower Bounds for Strassen-like matmul algorithms
                                                                                                                                              • vs
                                                                                                                                              • Slide 26
                                                                                                                                              • Strassen-like beyond matmul
                                                                                                                                              • Cache and Network Oblivious Algorithms
                                                                                                                                              • CARMA Performance Distributed Memory
                                                                                                                                              • CARMA Performance Distributed Memory (2)
                                                                                                                                              • CARMA Performance Shared Memory
                                                                                                                                              • CARMA Performance Shared Memory (2)
                                                                                                                                              • Why is CARMA Faster in Shared Memory
                                                                                                                                              • Outline (4)
                                                                                                                                              • One-sided Factorizations (LU QR) so far
                                                                                                                                              • TSQR An Architecture-Dependent Algorithm
                                                                                                                                              • Back to LU Using similar idea for TSLU as TSQR Use reduction
                                                                                                                                              • Minimizing Communication in TSLU
                                                                                                                                              • Making TSLU Numerically Stable
                                                                                                                                              • Stability of LU using TSLU CALU
                                                                                                                                              • Why is stability of TSLU just a ldquoThmrdquo
                                                                                                                                              • Fixing TSLU
                                                                                                                                              • 2D CALU with Tournament Pivoting
                                                                                                                                              • 25D CALU with Tournament Pivoting (c=4 copies)
                                                                                                                                              • Exascale Machine Parameters Source DOE Exascale Workshop
                                                                                                                                              • Exascale predicted speedups for Gaussian Elimination 2D CA
                                                                                                                                              • 25D vs 2D LU With and Without Pivoting
                                                                                                                                              • Other CA algorithms for Ax=b least squares(13)
                                                                                                                                              • Other CA algorithms for Ax=b least squares (23)
                                                                                                                                              • Other CA algorithms for Ax=b least squares (33)
                                                                                                                                              • Outline (5)
                                                                                                                                              • What about sparse matrices (13)
                                                                                                                                              • Performance of 25D APSP using Kleene
                                                                                                                                              • What about sparse matrices (23)
                                                                                                                                              • What about sparse matrices (33)
                                                                                                                                              • Outline (6)
                                                                                                                                              • Symmetric Eigenproblem and SVD
                                                                                                                                              • Slide 58
                                                                                                                                              • Slide 59
                                                                                                                                              • Slide 60
                                                                                                                                              • Slide 61
                                                                                                                                              • Slide 62
                                                                                                                                              • Slide 63
                                                                                                                                              • Slide 64
                                                                                                                                              • Slide 65
                                                                                                                                              • Slide 66
                                                                                                                                              • Slide 67
                                                                                                                                              • Slide 68
                                                                                                                                              • Conventional vs CA - SBR
                                                                                                                                              • Speedups of Sym Band Reduction vs DSBTRD
                                                                                                                                              • Nonsymmetric Eigenproblem
                                                                                                                                              • Attaining the Lower bounds Sequential
                                                                                                                                              • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
                                                                                                                                              • Outline (7)
                                                                                                                                              • Avoiding Communication in Iterative Linear Algebra
                                                                                                                                              • Outline (8)
                                                                                                                                              • Example The Difficulty of Tuning SpMV
                                                                                                                                              • Example The Difficulty of Tuning
                                                                                                                                              • Speedups on Itanium 2 The Need for Search
                                                                                                                                              • Register Profile Itanium 2
                                                                                                                                              • Register Profiles IBM and Intel IA-64
                                                                                                                                              • Another example of tuning challenges for SpMV
                                                                                                                                              • Zoom in to top corner
                                                                                                                                              • 3x3 blocks look natural buthellip
                                                                                                                                              • Extra Work Can Improve Efficiency
                                                                                                                                              • Slide 86
                                                                                                                                              • Slide 87
                                                                                                                                              • Slide 88
                                                                                                                                              • Slide 89
                                                                                                                                              • Summary of Other Performance Optimizations
                                                                                                                                              • Optimized Sparse Kernel Interface - OSKI
                                                                                                                                              • Outline (9)
                                                                                                                                              • Example Classical Conjugate Gradient (CG)
                                                                                                                                              • Example CA-Conjugate Gradient
                                                                                                                                              • Outline (10)
                                                                                                                                              • Slide 96
                                                                                                                                              • Slide 97
                                                                                                                                              • Outline (11)
                                                                                                                                              • What is a ldquosparse matrixrdquo
                                                                                                                                              • Outline (12)
                                                                                                                                              • Reproducible Floating Point Computation
                                                                                                                                              • Intel MKL non-reproducibility
                                                                                                                                              • GoalsApproaches for Reproducibility
                                                                                                                                              • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
                                                                                                                                              • Collaborators and Supporters
                                                                                                                                              • Summary

                                                                                                                                                Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                                                ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                                                ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                                                bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                                                bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                                                Avoiding Communication in Iterative Linear Algebra

                                                                                                                                                bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

                                                                                                                                                bull Conjugate Gradients (CG) GMRES Lanczos Arnoldi hellip bull Goal minimize communication

                                                                                                                                                ndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

                                                                                                                                                bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

                                                                                                                                                ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

                                                                                                                                                bull Lots of speed up possible (modeled and measured)ndash Price some redundant computationndash Challenges Poor partitioning Preconditioning Num Stability

                                                                                                                                                75

                                                                                                                                                Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                                                ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                                                ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                                                bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                                                bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                                                Example The Difficulty of Tuning SpMV

                                                                                                                                                bull n = 21200bull nnz = 15 M

                                                                                                                                                bull Source NASA structural analysis problem (raefsky)

                                                                                                                                                77

                                                                                                                                                Example The Difficulty of Tuning

                                                                                                                                                bull n = 21200bull nnz = 15 M

                                                                                                                                                bull Source NASA structural analysis problem (raefsky)

                                                                                                                                                bull 8x8 dense substructure exploit this to limit mem_refs

                                                                                                                                                78

                                                                                                                                                Speedups on Itanium 2 The Need for Search

                                                                                                                                                Reference

                                                                                                                                                Best 4x2

                                                                                                                                                Mflops

                                                                                                                                                Mflops

                                                                                                                                                79

                                                                                                                                                Register Profile Itanium 2

                                                                                                                                                190 Mflops

                                                                                                                                                1190 Mflops

                                                                                                                                                80

                                                                                                                                                Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16

                                                                                                                                                Itanium 2 - 33Itanium 1 - 8

                                                                                                                                                252 Mflops

                                                                                                                                                122 Mflops

                                                                                                                                                820 Mflops

                                                                                                                                                459 Mflops

                                                                                                                                                247 Mflops

                                                                                                                                                107 Mflops

                                                                                                                                                12 Gflops

                                                                                                                                                190 Mflops

                                                                                                                                                Another example of tuning challenges for SpMV

                                                                                                                                                bull Ex11 matrix (fluid flow)

                                                                                                                                                bull More complicated non-zero structure in general

                                                                                                                                                bull N = 16614bull NNZ = 11M

                                                                                                                                                82

                                                                                                                                                Zoom in to top corner

                                                                                                                                                bull More complicated non-zero structure in general

                                                                                                                                                bull N = 16614bull NNZ = 11M

                                                                                                                                                83

                                                                                                                                                3x3 blocks look natural buthellip

                                                                                                                                                bull Example 3x3 blockingndash Logical grid of 3x3 cells

                                                                                                                                                bull But would lead to lots of ldquofill-inrdquo

                                                                                                                                                84

                                                                                                                                                Extra Work Can Improve Efficiency

                                                                                                                                                bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

                                                                                                                                                bull On Pentium III 15x speedupndash Actual mflop rate 152 = 225 higher

                                                                                                                                                85

                                                                                                                                                Source Accelerator Cavity Design Problem (Ko via Husbands)

                                                                                                                                                86

                                                                                                                                                100x100 Submatrix Along Diagonal

                                                                                                                                                Summer School Lecture 787

                                                                                                                                                Post-RCM Reordering

                                                                                                                                                88

                                                                                                                                                Effect of Combined RCM+TSP Reordering

                                                                                                                                                Before Green + RedAfter Green + Blue

                                                                                                                                                Summer School Lecture 789

                                                                                                                                                2x speedups on Pentium 4 Power 4 hellip

                                                                                                                                                Summary of Other Performance Optimizations

                                                                                                                                                bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

                                                                                                                                                bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

                                                                                                                                                bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

                                                                                                                                                90

                                                                                                                                                Optimized Sparse Kernel Interface - OSKI

                                                                                                                                                bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

                                                                                                                                                bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

                                                                                                                                                bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

                                                                                                                                                software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

                                                                                                                                                91

                                                                                                                                                Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                                                ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                                                ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                                                bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                                                bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                                                93

                                                                                                                                                Example Classical Conjugate Gradient (CG)

                                                                                                                                                SpMVs and dot products require communication in

                                                                                                                                                each iteration

                                                                                                                                                via CA Matrix Powers Kernel

                                                                                                                                                Global reduction to compute G

                                                                                                                                                94

                                                                                                                                                Example CA-Conjugate Gradient

                                                                                                                                                Local computations within inner loop require

                                                                                                                                                no communication

                                                                                                                                                Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                                                ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                                                ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                                                bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                                                bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                                                96

                                                                                                                                                Slower convergence due

                                                                                                                                                to roundoff

                                                                                                                                                Loss of accuracy due to roundoff

                                                                                                                                                At s = 16 monomial basis is rank deficient Method breaks down

                                                                                                                                                Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

                                                                                                                                                CA-CG (monomial)CG

                                                                                                                                                machine precision

                                                                                                                                                97

                                                                                                                                                Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                                                ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                                                ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                                                bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                                                bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                                                What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

                                                                                                                                                bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

                                                                                                                                                bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

                                                                                                                                                matrices

                                                                                                                                                Explicit (O(nnz)) Implicit (o(nnz))

                                                                                                                                                Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

                                                                                                                                                Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

                                                                                                                                                Indices

                                                                                                                                                Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                                                ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                                                ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                                                bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                                                bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                                                101

                                                                                                                                                bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

                                                                                                                                                ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

                                                                                                                                                demanded by customers (construction engineers) otherwise they donrsquot believe results

                                                                                                                                                ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

                                                                                                                                                ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

                                                                                                                                                Reproducible Floating Point Computation

                                                                                                                                                Absolute Error for Random Vectors

                                                                                                                                                Same magnitude opposite signs

                                                                                                                                                Intel MKL non-reproducibility

                                                                                                                                                Relative Error for Orthogonal vectors

                                                                                                                                                Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

                                                                                                                                                Sign notreproducible

                                                                                                                                                103

                                                                                                                                                bull Consider summation or dot productbull Goals

                                                                                                                                                1 Same answer independent of layout processors order of summands

                                                                                                                                                2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

                                                                                                                                                bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

                                                                                                                                                GoalsApproaches for Reproducibility

                                                                                                                                                104

                                                                                                                                                Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

                                                                                                                                                Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

                                                                                                                                                Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

                                                                                                                                                bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

                                                                                                                                                Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

                                                                                                                                                bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

                                                                                                                                                Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

                                                                                                                                                Instruments NEC Nokia NVIDIA Samsung Oracle

                                                                                                                                                bull bebopcsberkeleyedu

                                                                                                                                                Summary

                                                                                                                                                Donrsquot Communichellip

                                                                                                                                                106

                                                                                                                                                Time to redesign all linear algebra n-body hellip algorithms and software

                                                                                                                                                (and compilers)

                                                                                                                                                • Implementing Communication-Avoiding Algorithms
                                                                                                                                                • Why avoid communication
                                                                                                                                                • Goals
                                                                                                                                                • Outline
                                                                                                                                                • Outline (2)
                                                                                                                                                • Lower bound for all ldquon3-likerdquo linear algebra
                                                                                                                                                • Lower bound for all ldquon3-likerdquo linear algebra (2)
                                                                                                                                                • Lower bound for all ldquon3-likerdquo linear algebra (3)
                                                                                                                                                • Limits to parallel scaling (12)
                                                                                                                                                • Limits to parallel scaling (22)
                                                                                                                                                • Can we attain these lower bounds
                                                                                                                                                • Outline (3)
                                                                                                                                                • 25D Matrix Multiplication
                                                                                                                                                • 25D Matrix Multiplication (2)
                                                                                                                                                • 25D Matmul on BGP 16K nodes 64K cores (2)
                                                                                                                                                • Perfect Strong Scaling ndash in Time and Energy (12)
                                                                                                                                                • Perfect Strong Scaling ndash in Time and Energy (22)
                                                                                                                                                • Handling Heterogeneity
                                                                                                                                                • Application to Tensor Contractions
                                                                                                                                                • C(ijk) = Σm A(ijm)B(mk)
                                                                                                                                                • Application to Tensor Contractions (2)
                                                                                                                                                • Communication Lower Bounds for Strassen-like matmul algorithms
                                                                                                                                                • vs
                                                                                                                                                • Slide 26
                                                                                                                                                • Strassen-like beyond matmul
                                                                                                                                                • Cache and Network Oblivious Algorithms
                                                                                                                                                • CARMA Performance Distributed Memory
                                                                                                                                                • CARMA Performance Distributed Memory (2)
                                                                                                                                                • CARMA Performance Shared Memory
                                                                                                                                                • CARMA Performance Shared Memory (2)
                                                                                                                                                • Why is CARMA Faster in Shared Memory
                                                                                                                                                • Outline (4)
                                                                                                                                                • One-sided Factorizations (LU QR) so far
                                                                                                                                                • TSQR An Architecture-Dependent Algorithm
                                                                                                                                                • Back to LU Using similar idea for TSLU as TSQR Use reduction
                                                                                                                                                • Minimizing Communication in TSLU
                                                                                                                                                • Making TSLU Numerically Stable
                                                                                                                                                • Stability of LU using TSLU CALU
                                                                                                                                                • Why is stability of TSLU just a ldquoThmrdquo
                                                                                                                                                • Fixing TSLU
                                                                                                                                                • 2D CALU with Tournament Pivoting
                                                                                                                                                • 25D CALU with Tournament Pivoting (c=4 copies)
                                                                                                                                                • Exascale Machine Parameters Source DOE Exascale Workshop
                                                                                                                                                • Exascale predicted speedups for Gaussian Elimination 2D CA
                                                                                                                                                • 25D vs 2D LU With and Without Pivoting
                                                                                                                                                • Other CA algorithms for Ax=b least squares(13)
                                                                                                                                                • Other CA algorithms for Ax=b least squares (23)
                                                                                                                                                • Other CA algorithms for Ax=b least squares (33)
                                                                                                                                                • Outline (5)
                                                                                                                                                • What about sparse matrices (13)
                                                                                                                                                • Performance of 25D APSP using Kleene
                                                                                                                                                • What about sparse matrices (23)
                                                                                                                                                • What about sparse matrices (33)
                                                                                                                                                • Outline (6)
                                                                                                                                                • Symmetric Eigenproblem and SVD
                                                                                                                                                • Slide 58
                                                                                                                                                • Slide 59
                                                                                                                                                • Slide 60
                                                                                                                                                • Slide 61
                                                                                                                                                • Slide 62
                                                                                                                                                • Slide 63
                                                                                                                                                • Slide 64
                                                                                                                                                • Slide 65
                                                                                                                                                • Slide 66
                                                                                                                                                • Slide 67
                                                                                                                                                • Slide 68
                                                                                                                                                • Conventional vs CA - SBR
                                                                                                                                                • Speedups of Sym Band Reduction vs DSBTRD
                                                                                                                                                • Nonsymmetric Eigenproblem
                                                                                                                                                • Attaining the Lower bounds Sequential
                                                                                                                                                • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
                                                                                                                                                • Outline (7)
                                                                                                                                                • Avoiding Communication in Iterative Linear Algebra
                                                                                                                                                • Outline (8)
                                                                                                                                                • Example The Difficulty of Tuning SpMV
                                                                                                                                                • Example The Difficulty of Tuning
                                                                                                                                                • Speedups on Itanium 2 The Need for Search
                                                                                                                                                • Register Profile Itanium 2
                                                                                                                                                • Register Profiles IBM and Intel IA-64
                                                                                                                                                • Another example of tuning challenges for SpMV
                                                                                                                                                • Zoom in to top corner
                                                                                                                                                • 3x3 blocks look natural buthellip
                                                                                                                                                • Extra Work Can Improve Efficiency
                                                                                                                                                • Slide 86
                                                                                                                                                • Slide 87
                                                                                                                                                • Slide 88
                                                                                                                                                • Slide 89
                                                                                                                                                • Summary of Other Performance Optimizations
                                                                                                                                                • Optimized Sparse Kernel Interface - OSKI
                                                                                                                                                • Outline (9)
                                                                                                                                                • Example Classical Conjugate Gradient (CG)
                                                                                                                                                • Example CA-Conjugate Gradient
                                                                                                                                                • Outline (10)
                                                                                                                                                • Slide 96
                                                                                                                                                • Slide 97
                                                                                                                                                • Outline (11)
                                                                                                                                                • What is a ldquosparse matrixrdquo
                                                                                                                                                • Outline (12)
                                                                                                                                                • Reproducible Floating Point Computation
                                                                                                                                                • Intel MKL non-reproducibility
                                                                                                                                                • GoalsApproaches for Reproducibility
                                                                                                                                                • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
                                                                                                                                                • Collaborators and Supporters
                                                                                                                                                • Summary

                                                                                                                                                  Avoiding Communication in Iterative Linear Algebra

                                                                                                                                                  bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

                                                                                                                                                  bull Conjugate Gradients (CG) GMRES Lanczos Arnoldi hellip bull Goal minimize communication

                                                                                                                                                  ndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

                                                                                                                                                  bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

                                                                                                                                                  ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

                                                                                                                                                  bull Lots of speed up possible (modeled and measured)ndash Price some redundant computationndash Challenges Poor partitioning Preconditioning Num Stability

                                                                                                                                                  75

                                                                                                                                                  Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                                                  ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                                                  ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                                                  bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                                                  bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                                                  Example The Difficulty of Tuning SpMV

                                                                                                                                                  bull n = 21200bull nnz = 15 M

                                                                                                                                                  bull Source NASA structural analysis problem (raefsky)

                                                                                                                                                  77

                                                                                                                                                  Example The Difficulty of Tuning

                                                                                                                                                  bull n = 21200bull nnz = 15 M

                                                                                                                                                  bull Source NASA structural analysis problem (raefsky)

                                                                                                                                                  bull 8x8 dense substructure exploit this to limit mem_refs

                                                                                                                                                  78

                                                                                                                                                  Speedups on Itanium 2 The Need for Search

                                                                                                                                                  Reference

                                                                                                                                                  Best 4x2

                                                                                                                                                  Mflops

                                                                                                                                                  Mflops

                                                                                                                                                  79

                                                                                                                                                  Register Profile Itanium 2

                                                                                                                                                  190 Mflops

                                                                                                                                                  1190 Mflops

                                                                                                                                                  80

                                                                                                                                                  Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16

                                                                                                                                                  Itanium 2 - 33Itanium 1 - 8

                                                                                                                                                  252 Mflops

                                                                                                                                                  122 Mflops

                                                                                                                                                  820 Mflops

                                                                                                                                                  459 Mflops

                                                                                                                                                  247 Mflops

                                                                                                                                                  107 Mflops

                                                                                                                                                  12 Gflops

                                                                                                                                                  190 Mflops

                                                                                                                                                  Another example of tuning challenges for SpMV

                                                                                                                                                  bull Ex11 matrix (fluid flow)

                                                                                                                                                  bull More complicated non-zero structure in general

                                                                                                                                                  bull N = 16614bull NNZ = 11M

                                                                                                                                                  82

                                                                                                                                                  Zoom in to top corner

                                                                                                                                                  bull More complicated non-zero structure in general

                                                                                                                                                  bull N = 16614bull NNZ = 11M

                                                                                                                                                  83

                                                                                                                                                  3x3 blocks look natural buthellip

                                                                                                                                                  bull Example 3x3 blockingndash Logical grid of 3x3 cells

                                                                                                                                                  bull But would lead to lots of ldquofill-inrdquo

                                                                                                                                                  84

                                                                                                                                                  Extra Work Can Improve Efficiency

                                                                                                                                                  bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

                                                                                                                                                  bull On Pentium III 15x speedupndash Actual mflop rate 152 = 225 higher

                                                                                                                                                  85

                                                                                                                                                  Source Accelerator Cavity Design Problem (Ko via Husbands)

                                                                                                                                                  86

                                                                                                                                                  100x100 Submatrix Along Diagonal

                                                                                                                                                  Summer School Lecture 787

                                                                                                                                                  Post-RCM Reordering

                                                                                                                                                  88

                                                                                                                                                  Effect of Combined RCM+TSP Reordering

                                                                                                                                                  Before Green + RedAfter Green + Blue

                                                                                                                                                  Summer School Lecture 789

                                                                                                                                                  2x speedups on Pentium 4 Power 4 hellip

                                                                                                                                                  Summary of Other Performance Optimizations

                                                                                                                                                  bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

                                                                                                                                                  bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

                                                                                                                                                  bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

                                                                                                                                                  90

                                                                                                                                                  Optimized Sparse Kernel Interface - OSKI

                                                                                                                                                  bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

                                                                                                                                                  bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

                                                                                                                                                  bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

                                                                                                                                                  software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

                                                                                                                                                  91

                                                                                                                                                  Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                                                  ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                                                  ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                                                  bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                                                  bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                                                  93

                                                                                                                                                  Example Classical Conjugate Gradient (CG)

                                                                                                                                                  SpMVs and dot products require communication in

                                                                                                                                                  each iteration

                                                                                                                                                  via CA Matrix Powers Kernel

                                                                                                                                                  Global reduction to compute G

                                                                                                                                                  94

                                                                                                                                                  Example CA-Conjugate Gradient

                                                                                                                                                  Local computations within inner loop require

                                                                                                                                                  no communication

                                                                                                                                                  Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                                                  ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                                                  ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                                                  bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                                                  bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                                                  96

                                                                                                                                                  Slower convergence due

                                                                                                                                                  to roundoff

                                                                                                                                                  Loss of accuracy due to roundoff

                                                                                                                                                  At s = 16 monomial basis is rank deficient Method breaks down

                                                                                                                                                  Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

                                                                                                                                                  CA-CG (monomial)CG

                                                                                                                                                  machine precision

                                                                                                                                                  97

                                                                                                                                                  Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                                                  ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                                                  ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                                                  bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                                                  bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                                                  What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

                                                                                                                                                  bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

                                                                                                                                                  bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

                                                                                                                                                  matrices

                                                                                                                                                  Explicit (O(nnz)) Implicit (o(nnz))

                                                                                                                                                  Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

                                                                                                                                                  Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

                                                                                                                                                  Indices

                                                                                                                                                  Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                                                  ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                                                  ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                                                  bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                                                  bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                                                  101

                                                                                                                                                  bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

                                                                                                                                                  ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

                                                                                                                                                  demanded by customers (construction engineers) otherwise they donrsquot believe results

                                                                                                                                                  ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

                                                                                                                                                  ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

                                                                                                                                                  Reproducible Floating Point Computation

                                                                                                                                                  Absolute Error for Random Vectors

                                                                                                                                                  Same magnitude opposite signs

                                                                                                                                                  Intel MKL non-reproducibility

                                                                                                                                                  Relative Error for Orthogonal vectors

                                                                                                                                                  Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

                                                                                                                                                  Sign notreproducible

                                                                                                                                                  103

                                                                                                                                                  bull Consider summation or dot productbull Goals

                                                                                                                                                  1 Same answer independent of layout processors order of summands

                                                                                                                                                  2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

                                                                                                                                                  bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

                                                                                                                                                  GoalsApproaches for Reproducibility

                                                                                                                                                  104

                                                                                                                                                  Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

                                                                                                                                                  Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

                                                                                                                                                  Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

                                                                                                                                                  bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

                                                                                                                                                  Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

                                                                                                                                                  bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

                                                                                                                                                  Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

                                                                                                                                                  Instruments NEC Nokia NVIDIA Samsung Oracle

                                                                                                                                                  bull bebopcsberkeleyedu

                                                                                                                                                  Summary

                                                                                                                                                  Donrsquot Communichellip

                                                                                                                                                  106

                                                                                                                                                  Time to redesign all linear algebra n-body hellip algorithms and software

                                                                                                                                                  (and compilers)

                                                                                                                                                  • Implementing Communication-Avoiding Algorithms
                                                                                                                                                  • Why avoid communication
                                                                                                                                                  • Goals
                                                                                                                                                  • Outline
                                                                                                                                                  • Outline (2)
                                                                                                                                                  • Lower bound for all ldquon3-likerdquo linear algebra
                                                                                                                                                  • Lower bound for all ldquon3-likerdquo linear algebra (2)
                                                                                                                                                  • Lower bound for all ldquon3-likerdquo linear algebra (3)
                                                                                                                                                  • Limits to parallel scaling (12)
                                                                                                                                                  • Limits to parallel scaling (22)
                                                                                                                                                  • Can we attain these lower bounds
                                                                                                                                                  • Outline (3)
                                                                                                                                                  • 25D Matrix Multiplication
                                                                                                                                                  • 25D Matrix Multiplication (2)
                                                                                                                                                  • 25D Matmul on BGP 16K nodes 64K cores (2)
                                                                                                                                                  • Perfect Strong Scaling ndash in Time and Energy (12)
                                                                                                                                                  • Perfect Strong Scaling ndash in Time and Energy (22)
                                                                                                                                                  • Handling Heterogeneity
                                                                                                                                                  • Application to Tensor Contractions
                                                                                                                                                  • C(ijk) = Σm A(ijm)B(mk)
                                                                                                                                                  • Application to Tensor Contractions (2)
                                                                                                                                                  • Communication Lower Bounds for Strassen-like matmul algorithms
                                                                                                                                                  • vs
                                                                                                                                                  • Slide 26
                                                                                                                                                  • Strassen-like beyond matmul
                                                                                                                                                  • Cache and Network Oblivious Algorithms
                                                                                                                                                  • CARMA Performance Distributed Memory
                                                                                                                                                  • CARMA Performance Distributed Memory (2)
                                                                                                                                                  • CARMA Performance Shared Memory
                                                                                                                                                  • CARMA Performance Shared Memory (2)
                                                                                                                                                  • Why is CARMA Faster in Shared Memory
                                                                                                                                                  • Outline (4)
                                                                                                                                                  • One-sided Factorizations (LU QR) so far
                                                                                                                                                  • TSQR An Architecture-Dependent Algorithm
                                                                                                                                                  • Back to LU Using similar idea for TSLU as TSQR Use reduction
                                                                                                                                                  • Minimizing Communication in TSLU
                                                                                                                                                  • Making TSLU Numerically Stable
                                                                                                                                                  • Stability of LU using TSLU CALU
                                                                                                                                                  • Why is stability of TSLU just a ldquoThmrdquo
                                                                                                                                                  • Fixing TSLU
                                                                                                                                                  • 2D CALU with Tournament Pivoting
                                                                                                                                                  • 25D CALU with Tournament Pivoting (c=4 copies)
                                                                                                                                                  • Exascale Machine Parameters Source DOE Exascale Workshop
                                                                                                                                                  • Exascale predicted speedups for Gaussian Elimination 2D CA
                                                                                                                                                  • 25D vs 2D LU With and Without Pivoting
                                                                                                                                                  • Other CA algorithms for Ax=b least squares(13)
                                                                                                                                                  • Other CA algorithms for Ax=b least squares (23)
                                                                                                                                                  • Other CA algorithms for Ax=b least squares (33)
                                                                                                                                                  • Outline (5)
                                                                                                                                                  • What about sparse matrices (13)
                                                                                                                                                  • Performance of 25D APSP using Kleene
                                                                                                                                                  • What about sparse matrices (23)
                                                                                                                                                  • What about sparse matrices (33)
                                                                                                                                                  • Outline (6)
                                                                                                                                                  • Symmetric Eigenproblem and SVD
                                                                                                                                                  • Slide 58
                                                                                                                                                  • Slide 59
                                                                                                                                                  • Slide 60
                                                                                                                                                  • Slide 61
                                                                                                                                                  • Slide 62
                                                                                                                                                  • Slide 63
                                                                                                                                                  • Slide 64
                                                                                                                                                  • Slide 65
                                                                                                                                                  • Slide 66
                                                                                                                                                  • Slide 67
                                                                                                                                                  • Slide 68
                                                                                                                                                  • Conventional vs CA - SBR
                                                                                                                                                  • Speedups of Sym Band Reduction vs DSBTRD
                                                                                                                                                  • Nonsymmetric Eigenproblem
                                                                                                                                                  • Attaining the Lower bounds Sequential
                                                                                                                                                  • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
                                                                                                                                                  • Outline (7)
                                                                                                                                                  • Avoiding Communication in Iterative Linear Algebra
                                                                                                                                                  • Outline (8)
                                                                                                                                                  • Example The Difficulty of Tuning SpMV
                                                                                                                                                  • Example The Difficulty of Tuning
                                                                                                                                                  • Speedups on Itanium 2 The Need for Search
                                                                                                                                                  • Register Profile Itanium 2
                                                                                                                                                  • Register Profiles IBM and Intel IA-64
                                                                                                                                                  • Another example of tuning challenges for SpMV
                                                                                                                                                  • Zoom in to top corner
                                                                                                                                                  • 3x3 blocks look natural buthellip
                                                                                                                                                  • Extra Work Can Improve Efficiency
                                                                                                                                                  • Slide 86
                                                                                                                                                  • Slide 87
                                                                                                                                                  • Slide 88
                                                                                                                                                  • Slide 89
                                                                                                                                                  • Summary of Other Performance Optimizations
                                                                                                                                                  • Optimized Sparse Kernel Interface - OSKI
                                                                                                                                                  • Outline (9)
                                                                                                                                                  • Example Classical Conjugate Gradient (CG)
                                                                                                                                                  • Example CA-Conjugate Gradient
                                                                                                                                                  • Outline (10)
                                                                                                                                                  • Slide 96
                                                                                                                                                  • Slide 97
                                                                                                                                                  • Outline (11)
                                                                                                                                                  • What is a ldquosparse matrixrdquo
                                                                                                                                                  • Outline (12)
                                                                                                                                                  • Reproducible Floating Point Computation
                                                                                                                                                  • Intel MKL non-reproducibility
                                                                                                                                                  • GoalsApproaches for Reproducibility
                                                                                                                                                  • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
                                                                                                                                                  • Collaborators and Supporters
                                                                                                                                                  • Summary

                                                                                                                                                    Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                                                    ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                                                    ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                                                    bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                                                    bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                                                    Example The Difficulty of Tuning SpMV

                                                                                                                                                    bull n = 21200bull nnz = 15 M

                                                                                                                                                    bull Source NASA structural analysis problem (raefsky)

                                                                                                                                                    77

                                                                                                                                                    Example The Difficulty of Tuning

                                                                                                                                                    bull n = 21200bull nnz = 15 M

                                                                                                                                                    bull Source NASA structural analysis problem (raefsky)

                                                                                                                                                    bull 8x8 dense substructure exploit this to limit mem_refs

                                                                                                                                                    78

                                                                                                                                                    Speedups on Itanium 2 The Need for Search

                                                                                                                                                    Reference

                                                                                                                                                    Best 4x2

                                                                                                                                                    Mflops

                                                                                                                                                    Mflops

                                                                                                                                                    79

                                                                                                                                                    Register Profile Itanium 2

                                                                                                                                                    190 Mflops

                                                                                                                                                    1190 Mflops

                                                                                                                                                    80

                                                                                                                                                    Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16

                                                                                                                                                    Itanium 2 - 33Itanium 1 - 8

                                                                                                                                                    252 Mflops

                                                                                                                                                    122 Mflops

                                                                                                                                                    820 Mflops

                                                                                                                                                    459 Mflops

                                                                                                                                                    247 Mflops

                                                                                                                                                    107 Mflops

                                                                                                                                                    12 Gflops

                                                                                                                                                    190 Mflops

                                                                                                                                                    Another example of tuning challenges for SpMV

                                                                                                                                                    bull Ex11 matrix (fluid flow)

                                                                                                                                                    bull More complicated non-zero structure in general

                                                                                                                                                    bull N = 16614bull NNZ = 11M

                                                                                                                                                    82

                                                                                                                                                    Zoom in to top corner

                                                                                                                                                    bull More complicated non-zero structure in general

                                                                                                                                                    bull N = 16614bull NNZ = 11M

                                                                                                                                                    83

                                                                                                                                                    3x3 blocks look natural buthellip

                                                                                                                                                    bull Example 3x3 blockingndash Logical grid of 3x3 cells

                                                                                                                                                    bull But would lead to lots of ldquofill-inrdquo

                                                                                                                                                    84

                                                                                                                                                    Extra Work Can Improve Efficiency

                                                                                                                                                    bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

                                                                                                                                                    bull On Pentium III 15x speedupndash Actual mflop rate 152 = 225 higher

                                                                                                                                                    85

                                                                                                                                                    Source Accelerator Cavity Design Problem (Ko via Husbands)

                                                                                                                                                    86

                                                                                                                                                    100x100 Submatrix Along Diagonal

                                                                                                                                                    Summer School Lecture 787

                                                                                                                                                    Post-RCM Reordering

                                                                                                                                                    88

                                                                                                                                                    Effect of Combined RCM+TSP Reordering

                                                                                                                                                    Before Green + RedAfter Green + Blue

                                                                                                                                                    Summer School Lecture 789

                                                                                                                                                    2x speedups on Pentium 4 Power 4 hellip

                                                                                                                                                    Summary of Other Performance Optimizations

                                                                                                                                                    bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

                                                                                                                                                    bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

                                                                                                                                                    bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

                                                                                                                                                    90

                                                                                                                                                    Optimized Sparse Kernel Interface - OSKI

                                                                                                                                                    bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

                                                                                                                                                    bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

                                                                                                                                                    bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

                                                                                                                                                    software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

                                                                                                                                                    91

                                                                                                                                                    Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                                                    ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                                                    ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                                                    bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                                                    bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                                                    93

                                                                                                                                                    Example Classical Conjugate Gradient (CG)

                                                                                                                                                    SpMVs and dot products require communication in

                                                                                                                                                    each iteration

                                                                                                                                                    via CA Matrix Powers Kernel

                                                                                                                                                    Global reduction to compute G

                                                                                                                                                    94

                                                                                                                                                    Example CA-Conjugate Gradient

                                                                                                                                                    Local computations within inner loop require

                                                                                                                                                    no communication

                                                                                                                                                    Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                                                    ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                                                    ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                                                    bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                                                    bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                                                    96

                                                                                                                                                    Slower convergence due

                                                                                                                                                    to roundoff

                                                                                                                                                    Loss of accuracy due to roundoff

                                                                                                                                                    At s = 16 monomial basis is rank deficient Method breaks down

                                                                                                                                                    Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

                                                                                                                                                    CA-CG (monomial)CG

                                                                                                                                                    machine precision

                                                                                                                                                    97

                                                                                                                                                    Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                                                    ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                                                    ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                                                    bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                                                    bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                                                    What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

                                                                                                                                                    bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

                                                                                                                                                    bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

                                                                                                                                                    matrices

                                                                                                                                                    Explicit (O(nnz)) Implicit (o(nnz))

                                                                                                                                                    Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

                                                                                                                                                    Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

                                                                                                                                                    Indices

                                                                                                                                                    Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                                                    ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                                                    ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                                                    bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                                                    bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                                                    101

                                                                                                                                                    bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

                                                                                                                                                    ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

                                                                                                                                                    demanded by customers (construction engineers) otherwise they donrsquot believe results

                                                                                                                                                    ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

                                                                                                                                                    ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

                                                                                                                                                    Reproducible Floating Point Computation

                                                                                                                                                    Absolute Error for Random Vectors

                                                                                                                                                    Same magnitude opposite signs

                                                                                                                                                    Intel MKL non-reproducibility

                                                                                                                                                    Relative Error for Orthogonal vectors

                                                                                                                                                    Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

                                                                                                                                                    Sign notreproducible

                                                                                                                                                    103

                                                                                                                                                    bull Consider summation or dot productbull Goals

                                                                                                                                                    1 Same answer independent of layout processors order of summands

                                                                                                                                                    2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

                                                                                                                                                    bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

                                                                                                                                                    GoalsApproaches for Reproducibility

                                                                                                                                                    104

                                                                                                                                                    Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

                                                                                                                                                    Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

                                                                                                                                                    Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

                                                                                                                                                    bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

                                                                                                                                                    Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

                                                                                                                                                    bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

                                                                                                                                                    Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

                                                                                                                                                    Instruments NEC Nokia NVIDIA Samsung Oracle

                                                                                                                                                    bull bebopcsberkeleyedu

                                                                                                                                                    Summary

                                                                                                                                                    Donrsquot Communichellip

                                                                                                                                                    106

                                                                                                                                                    Time to redesign all linear algebra n-body hellip algorithms and software

                                                                                                                                                    (and compilers)

                                                                                                                                                    • Implementing Communication-Avoiding Algorithms
                                                                                                                                                    • Why avoid communication
                                                                                                                                                    • Goals
                                                                                                                                                    • Outline
                                                                                                                                                    • Outline (2)
                                                                                                                                                    • Lower bound for all ldquon3-likerdquo linear algebra
                                                                                                                                                    • Lower bound for all ldquon3-likerdquo linear algebra (2)
                                                                                                                                                    • Lower bound for all ldquon3-likerdquo linear algebra (3)
                                                                                                                                                    • Limits to parallel scaling (12)
                                                                                                                                                    • Limits to parallel scaling (22)
                                                                                                                                                    • Can we attain these lower bounds
                                                                                                                                                    • Outline (3)
                                                                                                                                                    • 25D Matrix Multiplication
                                                                                                                                                    • 25D Matrix Multiplication (2)
                                                                                                                                                    • 25D Matmul on BGP 16K nodes 64K cores (2)
                                                                                                                                                    • Perfect Strong Scaling ndash in Time and Energy (12)
                                                                                                                                                    • Perfect Strong Scaling ndash in Time and Energy (22)
                                                                                                                                                    • Handling Heterogeneity
                                                                                                                                                    • Application to Tensor Contractions
                                                                                                                                                    • C(ijk) = Σm A(ijm)B(mk)
                                                                                                                                                    • Application to Tensor Contractions (2)
                                                                                                                                                    • Communication Lower Bounds for Strassen-like matmul algorithms
                                                                                                                                                    • vs
                                                                                                                                                    • Slide 26
                                                                                                                                                    • Strassen-like beyond matmul
                                                                                                                                                    • Cache and Network Oblivious Algorithms
                                                                                                                                                    • CARMA Performance Distributed Memory
                                                                                                                                                    • CARMA Performance Distributed Memory (2)
                                                                                                                                                    • CARMA Performance Shared Memory
                                                                                                                                                    • CARMA Performance Shared Memory (2)
                                                                                                                                                    • Why is CARMA Faster in Shared Memory
                                                                                                                                                    • Outline (4)
                                                                                                                                                    • One-sided Factorizations (LU QR) so far
                                                                                                                                                    • TSQR An Architecture-Dependent Algorithm
                                                                                                                                                    • Back to LU Using similar idea for TSLU as TSQR Use reduction
                                                                                                                                                    • Minimizing Communication in TSLU
                                                                                                                                                    • Making TSLU Numerically Stable
                                                                                                                                                    • Stability of LU using TSLU CALU
                                                                                                                                                    • Why is stability of TSLU just a ldquoThmrdquo
                                                                                                                                                    • Fixing TSLU
                                                                                                                                                    • 2D CALU with Tournament Pivoting
                                                                                                                                                    • 25D CALU with Tournament Pivoting (c=4 copies)
                                                                                                                                                    • Exascale Machine Parameters Source DOE Exascale Workshop
                                                                                                                                                    • Exascale predicted speedups for Gaussian Elimination 2D CA
                                                                                                                                                    • 25D vs 2D LU With and Without Pivoting
                                                                                                                                                    • Other CA algorithms for Ax=b least squares(13)
                                                                                                                                                    • Other CA algorithms for Ax=b least squares (23)
                                                                                                                                                    • Other CA algorithms for Ax=b least squares (33)
                                                                                                                                                    • Outline (5)
                                                                                                                                                    • What about sparse matrices (13)
                                                                                                                                                    • Performance of 25D APSP using Kleene
                                                                                                                                                    • What about sparse matrices (23)
                                                                                                                                                    • What about sparse matrices (33)
                                                                                                                                                    • Outline (6)
                                                                                                                                                    • Symmetric Eigenproblem and SVD
                                                                                                                                                    • Slide 58
                                                                                                                                                    • Slide 59
                                                                                                                                                    • Slide 60
                                                                                                                                                    • Slide 61
                                                                                                                                                    • Slide 62
                                                                                                                                                    • Slide 63
                                                                                                                                                    • Slide 64
                                                                                                                                                    • Slide 65
                                                                                                                                                    • Slide 66
                                                                                                                                                    • Slide 67
                                                                                                                                                    • Slide 68
                                                                                                                                                    • Conventional vs CA - SBR
                                                                                                                                                    • Speedups of Sym Band Reduction vs DSBTRD
                                                                                                                                                    • Nonsymmetric Eigenproblem
                                                                                                                                                    • Attaining the Lower bounds Sequential
                                                                                                                                                    • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
                                                                                                                                                    • Outline (7)
                                                                                                                                                    • Avoiding Communication in Iterative Linear Algebra
                                                                                                                                                    • Outline (8)
                                                                                                                                                    • Example The Difficulty of Tuning SpMV
                                                                                                                                                    • Example The Difficulty of Tuning
                                                                                                                                                    • Speedups on Itanium 2 The Need for Search
                                                                                                                                                    • Register Profile Itanium 2
                                                                                                                                                    • Register Profiles IBM and Intel IA-64
                                                                                                                                                    • Another example of tuning challenges for SpMV
                                                                                                                                                    • Zoom in to top corner
                                                                                                                                                    • 3x3 blocks look natural buthellip
                                                                                                                                                    • Extra Work Can Improve Efficiency
                                                                                                                                                    • Slide 86
                                                                                                                                                    • Slide 87
                                                                                                                                                    • Slide 88
                                                                                                                                                    • Slide 89
                                                                                                                                                    • Summary of Other Performance Optimizations
                                                                                                                                                    • Optimized Sparse Kernel Interface - OSKI
                                                                                                                                                    • Outline (9)
                                                                                                                                                    • Example Classical Conjugate Gradient (CG)
                                                                                                                                                    • Example CA-Conjugate Gradient
                                                                                                                                                    • Outline (10)
                                                                                                                                                    • Slide 96
                                                                                                                                                    • Slide 97
                                                                                                                                                    • Outline (11)
                                                                                                                                                    • What is a ldquosparse matrixrdquo
                                                                                                                                                    • Outline (12)
                                                                                                                                                    • Reproducible Floating Point Computation
                                                                                                                                                    • Intel MKL non-reproducibility
                                                                                                                                                    • GoalsApproaches for Reproducibility
                                                                                                                                                    • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
                                                                                                                                                    • Collaborators and Supporters
                                                                                                                                                    • Summary

                                                                                                                                                      Example The Difficulty of Tuning SpMV

                                                                                                                                                      bull n = 21200bull nnz = 15 M

                                                                                                                                                      bull Source NASA structural analysis problem (raefsky)

                                                                                                                                                      77

                                                                                                                                                      Example The Difficulty of Tuning

                                                                                                                                                      bull n = 21200bull nnz = 15 M

                                                                                                                                                      bull Source NASA structural analysis problem (raefsky)

                                                                                                                                                      bull 8x8 dense substructure exploit this to limit mem_refs

                                                                                                                                                      78

                                                                                                                                                      Speedups on Itanium 2 The Need for Search

                                                                                                                                                      Reference

                                                                                                                                                      Best 4x2

                                                                                                                                                      Mflops

                                                                                                                                                      Mflops

                                                                                                                                                      79

                                                                                                                                                      Register Profile Itanium 2

                                                                                                                                                      190 Mflops

                                                                                                                                                      1190 Mflops

                                                                                                                                                      80

                                                                                                                                                      Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16

                                                                                                                                                      Itanium 2 - 33Itanium 1 - 8

                                                                                                                                                      252 Mflops

                                                                                                                                                      122 Mflops

                                                                                                                                                      820 Mflops

                                                                                                                                                      459 Mflops

                                                                                                                                                      247 Mflops

                                                                                                                                                      107 Mflops

                                                                                                                                                      12 Gflops

                                                                                                                                                      190 Mflops

                                                                                                                                                      Another example of tuning challenges for SpMV

                                                                                                                                                      bull Ex11 matrix (fluid flow)

                                                                                                                                                      bull More complicated non-zero structure in general

                                                                                                                                                      bull N = 16614bull NNZ = 11M

                                                                                                                                                      82

                                                                                                                                                      Zoom in to top corner

                                                                                                                                                      bull More complicated non-zero structure in general

                                                                                                                                                      bull N = 16614bull NNZ = 11M

                                                                                                                                                      83

                                                                                                                                                      3x3 blocks look natural buthellip

                                                                                                                                                      bull Example 3x3 blockingndash Logical grid of 3x3 cells

                                                                                                                                                      bull But would lead to lots of ldquofill-inrdquo

                                                                                                                                                      84

                                                                                                                                                      Extra Work Can Improve Efficiency

                                                                                                                                                      bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

                                                                                                                                                      bull On Pentium III 15x speedupndash Actual mflop rate 152 = 225 higher

                                                                                                                                                      85

                                                                                                                                                      Source Accelerator Cavity Design Problem (Ko via Husbands)

                                                                                                                                                      86

                                                                                                                                                      100x100 Submatrix Along Diagonal

                                                                                                                                                      Summer School Lecture 787

                                                                                                                                                      Post-RCM Reordering

                                                                                                                                                      88

                                                                                                                                                      Effect of Combined RCM+TSP Reordering

                                                                                                                                                      Before Green + RedAfter Green + Blue

                                                                                                                                                      Summer School Lecture 789

                                                                                                                                                      2x speedups on Pentium 4 Power 4 hellip

                                                                                                                                                      Summary of Other Performance Optimizations

                                                                                                                                                      bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

                                                                                                                                                      bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

                                                                                                                                                      bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

                                                                                                                                                      90

                                                                                                                                                      Optimized Sparse Kernel Interface - OSKI

                                                                                                                                                      bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

                                                                                                                                                      bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

                                                                                                                                                      bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

                                                                                                                                                      software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

                                                                                                                                                      91

                                                                                                                                                      Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                                                      ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                                                      ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                                                      bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                                                      bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                                                      93

                                                                                                                                                      Example Classical Conjugate Gradient (CG)

                                                                                                                                                      SpMVs and dot products require communication in

                                                                                                                                                      each iteration

                                                                                                                                                      via CA Matrix Powers Kernel

                                                                                                                                                      Global reduction to compute G

                                                                                                                                                      94

                                                                                                                                                      Example CA-Conjugate Gradient

                                                                                                                                                      Local computations within inner loop require

                                                                                                                                                      no communication

                                                                                                                                                      Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                                                      ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                                                      ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                                                      bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                                                      bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                                                      96

                                                                                                                                                      Slower convergence due

                                                                                                                                                      to roundoff

                                                                                                                                                      Loss of accuracy due to roundoff

                                                                                                                                                      At s = 16 monomial basis is rank deficient Method breaks down

                                                                                                                                                      Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

                                                                                                                                                      CA-CG (monomial)CG

                                                                                                                                                      machine precision

                                                                                                                                                      97

                                                                                                                                                      Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                                                      ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                                                      ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                                                      bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                                                      bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                                                      What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

                                                                                                                                                      bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

                                                                                                                                                      bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

                                                                                                                                                      matrices

                                                                                                                                                      Explicit (O(nnz)) Implicit (o(nnz))

                                                                                                                                                      Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

                                                                                                                                                      Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

                                                                                                                                                      Indices

                                                                                                                                                      Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                                                      ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                                                      ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                                                      bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                                                      bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                                                      101

                                                                                                                                                      bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

                                                                                                                                                      ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

                                                                                                                                                      demanded by customers (construction engineers) otherwise they donrsquot believe results

                                                                                                                                                      ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

                                                                                                                                                      ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

                                                                                                                                                      Reproducible Floating Point Computation

                                                                                                                                                      Absolute Error for Random Vectors

                                                                                                                                                      Same magnitude opposite signs

                                                                                                                                                      Intel MKL non-reproducibility

                                                                                                                                                      Relative Error for Orthogonal vectors

                                                                                                                                                      Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

                                                                                                                                                      Sign notreproducible

                                                                                                                                                      103

                                                                                                                                                      bull Consider summation or dot productbull Goals

                                                                                                                                                      1 Same answer independent of layout processors order of summands

                                                                                                                                                      2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

                                                                                                                                                      bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

                                                                                                                                                      GoalsApproaches for Reproducibility

                                                                                                                                                      104

                                                                                                                                                      Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

                                                                                                                                                      Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

                                                                                                                                                      Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

                                                                                                                                                      bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

                                                                                                                                                      Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

                                                                                                                                                      bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

                                                                                                                                                      Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

                                                                                                                                                      Instruments NEC Nokia NVIDIA Samsung Oracle

                                                                                                                                                      bull bebopcsberkeleyedu

                                                                                                                                                      Summary

                                                                                                                                                      Donrsquot Communichellip

                                                                                                                                                      106

                                                                                                                                                      Time to redesign all linear algebra n-body hellip algorithms and software

                                                                                                                                                      (and compilers)

                                                                                                                                                      • Implementing Communication-Avoiding Algorithms
                                                                                                                                                      • Why avoid communication
                                                                                                                                                      • Goals
                                                                                                                                                      • Outline
                                                                                                                                                      • Outline (2)
                                                                                                                                                      • Lower bound for all ldquon3-likerdquo linear algebra
                                                                                                                                                      • Lower bound for all ldquon3-likerdquo linear algebra (2)
                                                                                                                                                      • Lower bound for all ldquon3-likerdquo linear algebra (3)
                                                                                                                                                      • Limits to parallel scaling (12)
                                                                                                                                                      • Limits to parallel scaling (22)
                                                                                                                                                      • Can we attain these lower bounds
                                                                                                                                                      • Outline (3)
                                                                                                                                                      • 25D Matrix Multiplication
                                                                                                                                                      • 25D Matrix Multiplication (2)
                                                                                                                                                      • 25D Matmul on BGP 16K nodes 64K cores (2)
                                                                                                                                                      • Perfect Strong Scaling ndash in Time and Energy (12)
                                                                                                                                                      • Perfect Strong Scaling ndash in Time and Energy (22)
                                                                                                                                                      • Handling Heterogeneity
                                                                                                                                                      • Application to Tensor Contractions
                                                                                                                                                      • C(ijk) = Σm A(ijm)B(mk)
                                                                                                                                                      • Application to Tensor Contractions (2)
                                                                                                                                                      • Communication Lower Bounds for Strassen-like matmul algorithms
                                                                                                                                                      • vs
                                                                                                                                                      • Slide 26
                                                                                                                                                      • Strassen-like beyond matmul
                                                                                                                                                      • Cache and Network Oblivious Algorithms
                                                                                                                                                      • CARMA Performance Distributed Memory
                                                                                                                                                      • CARMA Performance Distributed Memory (2)
                                                                                                                                                      • CARMA Performance Shared Memory
                                                                                                                                                      • CARMA Performance Shared Memory (2)
                                                                                                                                                      • Why is CARMA Faster in Shared Memory
                                                                                                                                                      • Outline (4)
                                                                                                                                                      • One-sided Factorizations (LU QR) so far
                                                                                                                                                      • TSQR An Architecture-Dependent Algorithm
                                                                                                                                                      • Back to LU Using similar idea for TSLU as TSQR Use reduction
                                                                                                                                                      • Minimizing Communication in TSLU
                                                                                                                                                      • Making TSLU Numerically Stable
                                                                                                                                                      • Stability of LU using TSLU CALU
                                                                                                                                                      • Why is stability of TSLU just a ldquoThmrdquo
                                                                                                                                                      • Fixing TSLU
                                                                                                                                                      • 2D CALU with Tournament Pivoting
                                                                                                                                                      • 25D CALU with Tournament Pivoting (c=4 copies)
                                                                                                                                                      • Exascale Machine Parameters Source DOE Exascale Workshop
                                                                                                                                                      • Exascale predicted speedups for Gaussian Elimination 2D CA
                                                                                                                                                      • 25D vs 2D LU With and Without Pivoting
                                                                                                                                                      • Other CA algorithms for Ax=b least squares(13)
                                                                                                                                                      • Other CA algorithms for Ax=b least squares (23)
                                                                                                                                                      • Other CA algorithms for Ax=b least squares (33)
                                                                                                                                                      • Outline (5)
                                                                                                                                                      • What about sparse matrices (13)
                                                                                                                                                      • Performance of 25D APSP using Kleene
                                                                                                                                                      • What about sparse matrices (23)
                                                                                                                                                      • What about sparse matrices (33)
                                                                                                                                                      • Outline (6)
                                                                                                                                                      • Symmetric Eigenproblem and SVD
                                                                                                                                                      • Slide 58
                                                                                                                                                      • Slide 59
                                                                                                                                                      • Slide 60
                                                                                                                                                      • Slide 61
                                                                                                                                                      • Slide 62
                                                                                                                                                      • Slide 63
                                                                                                                                                      • Slide 64
                                                                                                                                                      • Slide 65
                                                                                                                                                      • Slide 66
                                                                                                                                                      • Slide 67
                                                                                                                                                      • Slide 68
                                                                                                                                                      • Conventional vs CA - SBR
                                                                                                                                                      • Speedups of Sym Band Reduction vs DSBTRD
                                                                                                                                                      • Nonsymmetric Eigenproblem
                                                                                                                                                      • Attaining the Lower bounds Sequential
                                                                                                                                                      • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
                                                                                                                                                      • Outline (7)
                                                                                                                                                      • Avoiding Communication in Iterative Linear Algebra
                                                                                                                                                      • Outline (8)
                                                                                                                                                      • Example The Difficulty of Tuning SpMV
                                                                                                                                                      • Example The Difficulty of Tuning
                                                                                                                                                      • Speedups on Itanium 2 The Need for Search
                                                                                                                                                      • Register Profile Itanium 2
                                                                                                                                                      • Register Profiles IBM and Intel IA-64
                                                                                                                                                      • Another example of tuning challenges for SpMV
                                                                                                                                                      • Zoom in to top corner
                                                                                                                                                      • 3x3 blocks look natural buthellip
                                                                                                                                                      • Extra Work Can Improve Efficiency
                                                                                                                                                      • Slide 86
                                                                                                                                                      • Slide 87
                                                                                                                                                      • Slide 88
                                                                                                                                                      • Slide 89
                                                                                                                                                      • Summary of Other Performance Optimizations
                                                                                                                                                      • Optimized Sparse Kernel Interface - OSKI
                                                                                                                                                      • Outline (9)
                                                                                                                                                      • Example Classical Conjugate Gradient (CG)
                                                                                                                                                      • Example CA-Conjugate Gradient
                                                                                                                                                      • Outline (10)
                                                                                                                                                      • Slide 96
                                                                                                                                                      • Slide 97
                                                                                                                                                      • Outline (11)
                                                                                                                                                      • What is a ldquosparse matrixrdquo
                                                                                                                                                      • Outline (12)
                                                                                                                                                      • Reproducible Floating Point Computation
                                                                                                                                                      • Intel MKL non-reproducibility
                                                                                                                                                      • GoalsApproaches for Reproducibility
                                                                                                                                                      • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
                                                                                                                                                      • Collaborators and Supporters
                                                                                                                                                      • Summary

                                                                                                                                                        Example The Difficulty of Tuning

                                                                                                                                                        bull n = 21200bull nnz = 15 M

                                                                                                                                                        bull Source NASA structural analysis problem (raefsky)

                                                                                                                                                        bull 8x8 dense substructure exploit this to limit mem_refs

                                                                                                                                                        78

                                                                                                                                                        Speedups on Itanium 2 The Need for Search

                                                                                                                                                        Reference

                                                                                                                                                        Best 4x2

                                                                                                                                                        Mflops

                                                                                                                                                        Mflops

                                                                                                                                                        79

                                                                                                                                                        Register Profile Itanium 2

                                                                                                                                                        190 Mflops

                                                                                                                                                        1190 Mflops

                                                                                                                                                        80

                                                                                                                                                        Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16

                                                                                                                                                        Itanium 2 - 33Itanium 1 - 8

                                                                                                                                                        252 Mflops

                                                                                                                                                        122 Mflops

                                                                                                                                                        820 Mflops

                                                                                                                                                        459 Mflops

                                                                                                                                                        247 Mflops

                                                                                                                                                        107 Mflops

                                                                                                                                                        12 Gflops

                                                                                                                                                        190 Mflops

                                                                                                                                                        Another example of tuning challenges for SpMV

                                                                                                                                                        bull Ex11 matrix (fluid flow)

                                                                                                                                                        bull More complicated non-zero structure in general

                                                                                                                                                        bull N = 16614bull NNZ = 11M

                                                                                                                                                        82

                                                                                                                                                        Zoom in to top corner

                                                                                                                                                        bull More complicated non-zero structure in general

                                                                                                                                                        bull N = 16614bull NNZ = 11M

                                                                                                                                                        83

                                                                                                                                                        3x3 blocks look natural buthellip

                                                                                                                                                        bull Example 3x3 blockingndash Logical grid of 3x3 cells

                                                                                                                                                        bull But would lead to lots of ldquofill-inrdquo

                                                                                                                                                        84

                                                                                                                                                        Extra Work Can Improve Efficiency

                                                                                                                                                        bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

                                                                                                                                                        bull On Pentium III 15x speedupndash Actual mflop rate 152 = 225 higher

                                                                                                                                                        85

                                                                                                                                                        Source Accelerator Cavity Design Problem (Ko via Husbands)

                                                                                                                                                        86

                                                                                                                                                        100x100 Submatrix Along Diagonal

                                                                                                                                                        Summer School Lecture 787

                                                                                                                                                        Post-RCM Reordering

                                                                                                                                                        88

                                                                                                                                                        Effect of Combined RCM+TSP Reordering

                                                                                                                                                        Before Green + RedAfter Green + Blue

                                                                                                                                                        Summer School Lecture 789

                                                                                                                                                        2x speedups on Pentium 4 Power 4 hellip

                                                                                                                                                        Summary of Other Performance Optimizations

                                                                                                                                                        bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

                                                                                                                                                        bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

                                                                                                                                                        bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

                                                                                                                                                        90

                                                                                                                                                        Optimized Sparse Kernel Interface - OSKI

                                                                                                                                                        bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

                                                                                                                                                        bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

                                                                                                                                                        bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

                                                                                                                                                        software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

                                                                                                                                                        91

                                                                                                                                                        Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                                                        ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                                                        ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                                                        bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                                                        bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                                                        93

                                                                                                                                                        Example Classical Conjugate Gradient (CG)

                                                                                                                                                        SpMVs and dot products require communication in

                                                                                                                                                        each iteration

                                                                                                                                                        via CA Matrix Powers Kernel

                                                                                                                                                        Global reduction to compute G

                                                                                                                                                        94

                                                                                                                                                        Example CA-Conjugate Gradient

                                                                                                                                                        Local computations within inner loop require

                                                                                                                                                        no communication

                                                                                                                                                        Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                                                        ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                                                        ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                                                        bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                                                        bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                                                        96

                                                                                                                                                        Slower convergence due

                                                                                                                                                        to roundoff

                                                                                                                                                        Loss of accuracy due to roundoff

                                                                                                                                                        At s = 16 monomial basis is rank deficient Method breaks down

                                                                                                                                                        Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

                                                                                                                                                        CA-CG (monomial)CG

                                                                                                                                                        machine precision

                                                                                                                                                        97

                                                                                                                                                        Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                                                        ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                                                        ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                                                        bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                                                        bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                                                        What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

                                                                                                                                                        bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

                                                                                                                                                        bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

                                                                                                                                                        matrices

                                                                                                                                                        Explicit (O(nnz)) Implicit (o(nnz))

                                                                                                                                                        Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

                                                                                                                                                        Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

                                                                                                                                                        Indices

                                                                                                                                                        Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                                                        ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                                                        ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                                                        bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                                                        bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                                                        101

                                                                                                                                                        bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

                                                                                                                                                        ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

                                                                                                                                                        demanded by customers (construction engineers) otherwise they donrsquot believe results

                                                                                                                                                        ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

                                                                                                                                                        ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

                                                                                                                                                        Reproducible Floating Point Computation

                                                                                                                                                        Absolute Error for Random Vectors

                                                                                                                                                        Same magnitude opposite signs

                                                                                                                                                        Intel MKL non-reproducibility

                                                                                                                                                        Relative Error for Orthogonal vectors

                                                                                                                                                        Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

                                                                                                                                                        Sign notreproducible

                                                                                                                                                        103

                                                                                                                                                        bull Consider summation or dot productbull Goals

                                                                                                                                                        1 Same answer independent of layout processors order of summands

                                                                                                                                                        2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

                                                                                                                                                        bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

                                                                                                                                                        GoalsApproaches for Reproducibility

                                                                                                                                                        104

                                                                                                                                                        Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

                                                                                                                                                        Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

                                                                                                                                                        Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

                                                                                                                                                        bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

                                                                                                                                                        Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

                                                                                                                                                        bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

                                                                                                                                                        Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

                                                                                                                                                        Instruments NEC Nokia NVIDIA Samsung Oracle

                                                                                                                                                        bull bebopcsberkeleyedu

                                                                                                                                                        Summary

                                                                                                                                                        Donrsquot Communichellip

                                                                                                                                                        106

                                                                                                                                                        Time to redesign all linear algebra n-body hellip algorithms and software

                                                                                                                                                        (and compilers)

                                                                                                                                                        • Implementing Communication-Avoiding Algorithms
                                                                                                                                                        • Why avoid communication
                                                                                                                                                        • Goals
                                                                                                                                                        • Outline
                                                                                                                                                        • Outline (2)
                                                                                                                                                        • Lower bound for all ldquon3-likerdquo linear algebra
                                                                                                                                                        • Lower bound for all ldquon3-likerdquo linear algebra (2)
                                                                                                                                                        • Lower bound for all ldquon3-likerdquo linear algebra (3)
                                                                                                                                                        • Limits to parallel scaling (12)
                                                                                                                                                        • Limits to parallel scaling (22)
                                                                                                                                                        • Can we attain these lower bounds
                                                                                                                                                        • Outline (3)
                                                                                                                                                        • 25D Matrix Multiplication
                                                                                                                                                        • 25D Matrix Multiplication (2)
                                                                                                                                                        • 25D Matmul on BGP 16K nodes 64K cores (2)
                                                                                                                                                        • Perfect Strong Scaling ndash in Time and Energy (12)
                                                                                                                                                        • Perfect Strong Scaling ndash in Time and Energy (22)
                                                                                                                                                        • Handling Heterogeneity
                                                                                                                                                        • Application to Tensor Contractions
                                                                                                                                                        • C(ijk) = Σm A(ijm)B(mk)
                                                                                                                                                        • Application to Tensor Contractions (2)
                                                                                                                                                        • Communication Lower Bounds for Strassen-like matmul algorithms
                                                                                                                                                        • vs
                                                                                                                                                        • Slide 26
                                                                                                                                                        • Strassen-like beyond matmul
                                                                                                                                                        • Cache and Network Oblivious Algorithms
                                                                                                                                                        • CARMA Performance Distributed Memory
                                                                                                                                                        • CARMA Performance Distributed Memory (2)
                                                                                                                                                        • CARMA Performance Shared Memory
                                                                                                                                                        • CARMA Performance Shared Memory (2)
                                                                                                                                                        • Why is CARMA Faster in Shared Memory
                                                                                                                                                        • Outline (4)
                                                                                                                                                        • One-sided Factorizations (LU QR) so far
                                                                                                                                                        • TSQR An Architecture-Dependent Algorithm
                                                                                                                                                        • Back to LU Using similar idea for TSLU as TSQR Use reduction
                                                                                                                                                        • Minimizing Communication in TSLU
                                                                                                                                                        • Making TSLU Numerically Stable
                                                                                                                                                        • Stability of LU using TSLU CALU
                                                                                                                                                        • Why is stability of TSLU just a ldquoThmrdquo
                                                                                                                                                        • Fixing TSLU
                                                                                                                                                        • 2D CALU with Tournament Pivoting
                                                                                                                                                        • 25D CALU with Tournament Pivoting (c=4 copies)
                                                                                                                                                        • Exascale Machine Parameters Source DOE Exascale Workshop
                                                                                                                                                        • Exascale predicted speedups for Gaussian Elimination 2D CA
                                                                                                                                                        • 25D vs 2D LU With and Without Pivoting
                                                                                                                                                        • Other CA algorithms for Ax=b least squares(13)
                                                                                                                                                        • Other CA algorithms for Ax=b least squares (23)
                                                                                                                                                        • Other CA algorithms for Ax=b least squares (33)
                                                                                                                                                        • Outline (5)
                                                                                                                                                        • What about sparse matrices (13)
                                                                                                                                                        • Performance of 25D APSP using Kleene
                                                                                                                                                        • What about sparse matrices (23)
                                                                                                                                                        • What about sparse matrices (33)
                                                                                                                                                        • Outline (6)
                                                                                                                                                        • Symmetric Eigenproblem and SVD
                                                                                                                                                        • Slide 58
                                                                                                                                                        • Slide 59
                                                                                                                                                        • Slide 60
                                                                                                                                                        • Slide 61
                                                                                                                                                        • Slide 62
                                                                                                                                                        • Slide 63
                                                                                                                                                        • Slide 64
                                                                                                                                                        • Slide 65
                                                                                                                                                        • Slide 66
                                                                                                                                                        • Slide 67
                                                                                                                                                        • Slide 68
                                                                                                                                                        • Conventional vs CA - SBR
                                                                                                                                                        • Speedups of Sym Band Reduction vs DSBTRD
                                                                                                                                                        • Nonsymmetric Eigenproblem
                                                                                                                                                        • Attaining the Lower bounds Sequential
                                                                                                                                                        • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
                                                                                                                                                        • Outline (7)
                                                                                                                                                        • Avoiding Communication in Iterative Linear Algebra
                                                                                                                                                        • Outline (8)
                                                                                                                                                        • Example The Difficulty of Tuning SpMV
                                                                                                                                                        • Example The Difficulty of Tuning
                                                                                                                                                        • Speedups on Itanium 2 The Need for Search
                                                                                                                                                        • Register Profile Itanium 2
                                                                                                                                                        • Register Profiles IBM and Intel IA-64
                                                                                                                                                        • Another example of tuning challenges for SpMV
                                                                                                                                                        • Zoom in to top corner
                                                                                                                                                        • 3x3 blocks look natural buthellip
                                                                                                                                                        • Extra Work Can Improve Efficiency
                                                                                                                                                        • Slide 86
                                                                                                                                                        • Slide 87
                                                                                                                                                        • Slide 88
                                                                                                                                                        • Slide 89
                                                                                                                                                        • Summary of Other Performance Optimizations
                                                                                                                                                        • Optimized Sparse Kernel Interface - OSKI
                                                                                                                                                        • Outline (9)
                                                                                                                                                        • Example Classical Conjugate Gradient (CG)
                                                                                                                                                        • Example CA-Conjugate Gradient
                                                                                                                                                        • Outline (10)
                                                                                                                                                        • Slide 96
                                                                                                                                                        • Slide 97
                                                                                                                                                        • Outline (11)
                                                                                                                                                        • What is a ldquosparse matrixrdquo
                                                                                                                                                        • Outline (12)
                                                                                                                                                        • Reproducible Floating Point Computation
                                                                                                                                                        • Intel MKL non-reproducibility
                                                                                                                                                        • GoalsApproaches for Reproducibility
                                                                                                                                                        • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
                                                                                                                                                        • Collaborators and Supporters
                                                                                                                                                        • Summary

                                                                                                                                                          Speedups on Itanium 2 The Need for Search

                                                                                                                                                          Reference

                                                                                                                                                          Best 4x2

                                                                                                                                                          Mflops

                                                                                                                                                          Mflops

                                                                                                                                                          79

                                                                                                                                                          Register Profile Itanium 2

                                                                                                                                                          190 Mflops

                                                                                                                                                          1190 Mflops

                                                                                                                                                          80

                                                                                                                                                          Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16

                                                                                                                                                          Itanium 2 - 33Itanium 1 - 8

                                                                                                                                                          252 Mflops

                                                                                                                                                          122 Mflops

                                                                                                                                                          820 Mflops

                                                                                                                                                          459 Mflops

                                                                                                                                                          247 Mflops

                                                                                                                                                          107 Mflops

                                                                                                                                                          12 Gflops

                                                                                                                                                          190 Mflops

                                                                                                                                                          Another example of tuning challenges for SpMV

                                                                                                                                                          bull Ex11 matrix (fluid flow)

                                                                                                                                                          bull More complicated non-zero structure in general

                                                                                                                                                          bull N = 16614bull NNZ = 11M

                                                                                                                                                          82

                                                                                                                                                          Zoom in to top corner

                                                                                                                                                          bull More complicated non-zero structure in general

                                                                                                                                                          bull N = 16614bull NNZ = 11M

                                                                                                                                                          83

                                                                                                                                                          3x3 blocks look natural buthellip

                                                                                                                                                          bull Example 3x3 blockingndash Logical grid of 3x3 cells

                                                                                                                                                          bull But would lead to lots of ldquofill-inrdquo

                                                                                                                                                          84

                                                                                                                                                          Extra Work Can Improve Efficiency

                                                                                                                                                          bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

                                                                                                                                                          bull On Pentium III 15x speedupndash Actual mflop rate 152 = 225 higher

                                                                                                                                                          85

                                                                                                                                                          Source Accelerator Cavity Design Problem (Ko via Husbands)

                                                                                                                                                          86

                                                                                                                                                          100x100 Submatrix Along Diagonal

                                                                                                                                                          Summer School Lecture 787

                                                                                                                                                          Post-RCM Reordering

                                                                                                                                                          88

                                                                                                                                                          Effect of Combined RCM+TSP Reordering

                                                                                                                                                          Before Green + RedAfter Green + Blue

                                                                                                                                                          Summer School Lecture 789

                                                                                                                                                          2x speedups on Pentium 4 Power 4 hellip

                                                                                                                                                          Summary of Other Performance Optimizations

                                                                                                                                                          bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

                                                                                                                                                          bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

                                                                                                                                                          bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

                                                                                                                                                          90

                                                                                                                                                          Optimized Sparse Kernel Interface - OSKI

                                                                                                                                                          bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

                                                                                                                                                          bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

                                                                                                                                                          bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

                                                                                                                                                          software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

                                                                                                                                                          91

                                                                                                                                                          Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                                                          ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                                                          ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                                                          bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                                                          bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                                                          93

                                                                                                                                                          Example Classical Conjugate Gradient (CG)

                                                                                                                                                          SpMVs and dot products require communication in

                                                                                                                                                          each iteration

                                                                                                                                                          via CA Matrix Powers Kernel

                                                                                                                                                          Global reduction to compute G

                                                                                                                                                          94

                                                                                                                                                          Example CA-Conjugate Gradient

                                                                                                                                                          Local computations within inner loop require

                                                                                                                                                          no communication

                                                                                                                                                          Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                                                          ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                                                          ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                                                          bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                                                          bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                                                          96

                                                                                                                                                          Slower convergence due

                                                                                                                                                          to roundoff

                                                                                                                                                          Loss of accuracy due to roundoff

                                                                                                                                                          At s = 16 monomial basis is rank deficient Method breaks down

                                                                                                                                                          Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

                                                                                                                                                          CA-CG (monomial)CG

                                                                                                                                                          machine precision

                                                                                                                                                          97

                                                                                                                                                          Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                                                          ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                                                          ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                                                          bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                                                          bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                                                          What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

                                                                                                                                                          bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

                                                                                                                                                          bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

                                                                                                                                                          matrices

                                                                                                                                                          Explicit (O(nnz)) Implicit (o(nnz))

                                                                                                                                                          Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

                                                                                                                                                          Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

                                                                                                                                                          Indices

                                                                                                                                                          Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                                                          ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                                                          ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                                                          bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                                                          bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                                                          101

                                                                                                                                                          bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

                                                                                                                                                          ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

                                                                                                                                                          demanded by customers (construction engineers) otherwise they donrsquot believe results

                                                                                                                                                          ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

                                                                                                                                                          ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

                                                                                                                                                          Reproducible Floating Point Computation

                                                                                                                                                          Absolute Error for Random Vectors

                                                                                                                                                          Same magnitude opposite signs

                                                                                                                                                          Intel MKL non-reproducibility

                                                                                                                                                          Relative Error for Orthogonal vectors

                                                                                                                                                          Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

                                                                                                                                                          Sign notreproducible

                                                                                                                                                          103

                                                                                                                                                          bull Consider summation or dot productbull Goals

                                                                                                                                                          1 Same answer independent of layout processors order of summands

                                                                                                                                                          2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

                                                                                                                                                          bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

                                                                                                                                                          GoalsApproaches for Reproducibility

                                                                                                                                                          104

                                                                                                                                                          Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

                                                                                                                                                          Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

                                                                                                                                                          Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

                                                                                                                                                          bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

                                                                                                                                                          Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

                                                                                                                                                          bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

                                                                                                                                                          Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

                                                                                                                                                          Instruments NEC Nokia NVIDIA Samsung Oracle

                                                                                                                                                          bull bebopcsberkeleyedu

                                                                                                                                                          Summary

                                                                                                                                                          Donrsquot Communichellip

                                                                                                                                                          106

                                                                                                                                                          Time to redesign all linear algebra n-body hellip algorithms and software

                                                                                                                                                          (and compilers)

                                                                                                                                                          • Implementing Communication-Avoiding Algorithms
                                                                                                                                                          • Why avoid communication
                                                                                                                                                          • Goals
                                                                                                                                                          • Outline
                                                                                                                                                          • Outline (2)
                                                                                                                                                          • Lower bound for all ldquon3-likerdquo linear algebra
                                                                                                                                                          • Lower bound for all ldquon3-likerdquo linear algebra (2)
                                                                                                                                                          • Lower bound for all ldquon3-likerdquo linear algebra (3)
                                                                                                                                                          • Limits to parallel scaling (12)
                                                                                                                                                          • Limits to parallel scaling (22)
                                                                                                                                                          • Can we attain these lower bounds
                                                                                                                                                          • Outline (3)
                                                                                                                                                          • 25D Matrix Multiplication
                                                                                                                                                          • 25D Matrix Multiplication (2)
                                                                                                                                                          • 25D Matmul on BGP 16K nodes 64K cores (2)
                                                                                                                                                          • Perfect Strong Scaling ndash in Time and Energy (12)
                                                                                                                                                          • Perfect Strong Scaling ndash in Time and Energy (22)
                                                                                                                                                          • Handling Heterogeneity
                                                                                                                                                          • Application to Tensor Contractions
                                                                                                                                                          • C(ijk) = Σm A(ijm)B(mk)
                                                                                                                                                          • Application to Tensor Contractions (2)
                                                                                                                                                          • Communication Lower Bounds for Strassen-like matmul algorithms
                                                                                                                                                          • vs
                                                                                                                                                          • Slide 26
                                                                                                                                                          • Strassen-like beyond matmul
                                                                                                                                                          • Cache and Network Oblivious Algorithms
                                                                                                                                                          • CARMA Performance Distributed Memory
                                                                                                                                                          • CARMA Performance Distributed Memory (2)
                                                                                                                                                          • CARMA Performance Shared Memory
                                                                                                                                                          • CARMA Performance Shared Memory (2)
                                                                                                                                                          • Why is CARMA Faster in Shared Memory
                                                                                                                                                          • Outline (4)
                                                                                                                                                          • One-sided Factorizations (LU QR) so far
                                                                                                                                                          • TSQR An Architecture-Dependent Algorithm
                                                                                                                                                          • Back to LU Using similar idea for TSLU as TSQR Use reduction
                                                                                                                                                          • Minimizing Communication in TSLU
                                                                                                                                                          • Making TSLU Numerically Stable
                                                                                                                                                          • Stability of LU using TSLU CALU
                                                                                                                                                          • Why is stability of TSLU just a ldquoThmrdquo
                                                                                                                                                          • Fixing TSLU
                                                                                                                                                          • 2D CALU with Tournament Pivoting
                                                                                                                                                          • 25D CALU with Tournament Pivoting (c=4 copies)
                                                                                                                                                          • Exascale Machine Parameters Source DOE Exascale Workshop
                                                                                                                                                          • Exascale predicted speedups for Gaussian Elimination 2D CA
                                                                                                                                                          • 25D vs 2D LU With and Without Pivoting
                                                                                                                                                          • Other CA algorithms for Ax=b least squares(13)
                                                                                                                                                          • Other CA algorithms for Ax=b least squares (23)
                                                                                                                                                          • Other CA algorithms for Ax=b least squares (33)
                                                                                                                                                          • Outline (5)
                                                                                                                                                          • What about sparse matrices (13)
                                                                                                                                                          • Performance of 25D APSP using Kleene
                                                                                                                                                          • What about sparse matrices (23)
                                                                                                                                                          • What about sparse matrices (33)
                                                                                                                                                          • Outline (6)
                                                                                                                                                          • Symmetric Eigenproblem and SVD
                                                                                                                                                          • Slide 58
                                                                                                                                                          • Slide 59
                                                                                                                                                          • Slide 60
                                                                                                                                                          • Slide 61
                                                                                                                                                          • Slide 62
                                                                                                                                                          • Slide 63
                                                                                                                                                          • Slide 64
                                                                                                                                                          • Slide 65
                                                                                                                                                          • Slide 66
                                                                                                                                                          • Slide 67
                                                                                                                                                          • Slide 68
                                                                                                                                                          • Conventional vs CA - SBR
                                                                                                                                                          • Speedups of Sym Band Reduction vs DSBTRD
                                                                                                                                                          • Nonsymmetric Eigenproblem
                                                                                                                                                          • Attaining the Lower bounds Sequential
                                                                                                                                                          • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
                                                                                                                                                          • Outline (7)
                                                                                                                                                          • Avoiding Communication in Iterative Linear Algebra
                                                                                                                                                          • Outline (8)
                                                                                                                                                          • Example The Difficulty of Tuning SpMV
                                                                                                                                                          • Example The Difficulty of Tuning
                                                                                                                                                          • Speedups on Itanium 2 The Need for Search
                                                                                                                                                          • Register Profile Itanium 2
                                                                                                                                                          • Register Profiles IBM and Intel IA-64
                                                                                                                                                          • Another example of tuning challenges for SpMV
                                                                                                                                                          • Zoom in to top corner
                                                                                                                                                          • 3x3 blocks look natural buthellip
                                                                                                                                                          • Extra Work Can Improve Efficiency
                                                                                                                                                          • Slide 86
                                                                                                                                                          • Slide 87
                                                                                                                                                          • Slide 88
                                                                                                                                                          • Slide 89
                                                                                                                                                          • Summary of Other Performance Optimizations
                                                                                                                                                          • Optimized Sparse Kernel Interface - OSKI
                                                                                                                                                          • Outline (9)
                                                                                                                                                          • Example Classical Conjugate Gradient (CG)
                                                                                                                                                          • Example CA-Conjugate Gradient
                                                                                                                                                          • Outline (10)
                                                                                                                                                          • Slide 96
                                                                                                                                                          • Slide 97
                                                                                                                                                          • Outline (11)
                                                                                                                                                          • What is a ldquosparse matrixrdquo
                                                                                                                                                          • Outline (12)
                                                                                                                                                          • Reproducible Floating Point Computation
                                                                                                                                                          • Intel MKL non-reproducibility
                                                                                                                                                          • GoalsApproaches for Reproducibility
                                                                                                                                                          • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
                                                                                                                                                          • Collaborators and Supporters
                                                                                                                                                          • Summary

                                                                                                                                                            Register Profile Itanium 2

                                                                                                                                                            190 Mflops

                                                                                                                                                            1190 Mflops

                                                                                                                                                            80

                                                                                                                                                            Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16

                                                                                                                                                            Itanium 2 - 33Itanium 1 - 8

                                                                                                                                                            252 Mflops

                                                                                                                                                            122 Mflops

                                                                                                                                                            820 Mflops

                                                                                                                                                            459 Mflops

                                                                                                                                                            247 Mflops

                                                                                                                                                            107 Mflops

                                                                                                                                                            12 Gflops

                                                                                                                                                            190 Mflops

                                                                                                                                                            Another example of tuning challenges for SpMV

                                                                                                                                                            bull Ex11 matrix (fluid flow)

                                                                                                                                                            bull More complicated non-zero structure in general

                                                                                                                                                            bull N = 16614bull NNZ = 11M

                                                                                                                                                            82

                                                                                                                                                            Zoom in to top corner

                                                                                                                                                            bull More complicated non-zero structure in general

                                                                                                                                                            bull N = 16614bull NNZ = 11M

                                                                                                                                                            83

                                                                                                                                                            3x3 blocks look natural buthellip

                                                                                                                                                            bull Example 3x3 blockingndash Logical grid of 3x3 cells

                                                                                                                                                            bull But would lead to lots of ldquofill-inrdquo

                                                                                                                                                            84

                                                                                                                                                            Extra Work Can Improve Efficiency

                                                                                                                                                            bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

                                                                                                                                                            bull On Pentium III 15x speedupndash Actual mflop rate 152 = 225 higher

                                                                                                                                                            85

                                                                                                                                                            Source Accelerator Cavity Design Problem (Ko via Husbands)

                                                                                                                                                            86

                                                                                                                                                            100x100 Submatrix Along Diagonal

                                                                                                                                                            Summer School Lecture 787

                                                                                                                                                            Post-RCM Reordering

                                                                                                                                                            88

                                                                                                                                                            Effect of Combined RCM+TSP Reordering

                                                                                                                                                            Before Green + RedAfter Green + Blue

                                                                                                                                                            Summer School Lecture 789

                                                                                                                                                            2x speedups on Pentium 4 Power 4 hellip

                                                                                                                                                            Summary of Other Performance Optimizations

                                                                                                                                                            bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

                                                                                                                                                            bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

                                                                                                                                                            bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

                                                                                                                                                            90

                                                                                                                                                            Optimized Sparse Kernel Interface - OSKI

                                                                                                                                                            bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

                                                                                                                                                            bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

                                                                                                                                                            bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

                                                                                                                                                            software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

                                                                                                                                                            91

                                                                                                                                                            Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                                                            ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                                                            ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                                                            bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                                                            bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                                                            93

                                                                                                                                                            Example Classical Conjugate Gradient (CG)

                                                                                                                                                            SpMVs and dot products require communication in

                                                                                                                                                            each iteration

                                                                                                                                                            via CA Matrix Powers Kernel

                                                                                                                                                            Global reduction to compute G

                                                                                                                                                            94

                                                                                                                                                            Example CA-Conjugate Gradient

                                                                                                                                                            Local computations within inner loop require

                                                                                                                                                            no communication

                                                                                                                                                            Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                                                            ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                                                            ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                                                            bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                                                            bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                                                            96

                                                                                                                                                            Slower convergence due

                                                                                                                                                            to roundoff

                                                                                                                                                            Loss of accuracy due to roundoff

                                                                                                                                                            At s = 16 monomial basis is rank deficient Method breaks down

                                                                                                                                                            Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

                                                                                                                                                            CA-CG (monomial)CG

                                                                                                                                                            machine precision

                                                                                                                                                            97

                                                                                                                                                            Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                                                            ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                                                            ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                                                            bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                                                            bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                                                            What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

                                                                                                                                                            bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

                                                                                                                                                            bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

                                                                                                                                                            matrices

                                                                                                                                                            Explicit (O(nnz)) Implicit (o(nnz))

                                                                                                                                                            Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

                                                                                                                                                            Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

                                                                                                                                                            Indices

                                                                                                                                                            Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                                                            ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                                                            ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                                                            bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                                                            bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                                                            101

                                                                                                                                                            bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

                                                                                                                                                            ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

                                                                                                                                                            demanded by customers (construction engineers) otherwise they donrsquot believe results

                                                                                                                                                            ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

                                                                                                                                                            ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

                                                                                                                                                            Reproducible Floating Point Computation

                                                                                                                                                            Absolute Error for Random Vectors

                                                                                                                                                            Same magnitude opposite signs

                                                                                                                                                            Intel MKL non-reproducibility

                                                                                                                                                            Relative Error for Orthogonal vectors

                                                                                                                                                            Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

                                                                                                                                                            Sign notreproducible

                                                                                                                                                            103

                                                                                                                                                            bull Consider summation or dot productbull Goals

                                                                                                                                                            1 Same answer independent of layout processors order of summands

                                                                                                                                                            2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

                                                                                                                                                            bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

                                                                                                                                                            GoalsApproaches for Reproducibility

                                                                                                                                                            104

                                                                                                                                                            Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

                                                                                                                                                            Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

                                                                                                                                                            Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

                                                                                                                                                            bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

                                                                                                                                                            Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

                                                                                                                                                            bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

                                                                                                                                                            Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

                                                                                                                                                            Instruments NEC Nokia NVIDIA Samsung Oracle

                                                                                                                                                            bull bebopcsberkeleyedu

                                                                                                                                                            Summary

                                                                                                                                                            Donrsquot Communichellip

                                                                                                                                                            106

                                                                                                                                                            Time to redesign all linear algebra n-body hellip algorithms and software

                                                                                                                                                            (and compilers)

                                                                                                                                                            • Implementing Communication-Avoiding Algorithms
                                                                                                                                                            • Why avoid communication
                                                                                                                                                            • Goals
                                                                                                                                                            • Outline
                                                                                                                                                            • Outline (2)
                                                                                                                                                            • Lower bound for all ldquon3-likerdquo linear algebra
                                                                                                                                                            • Lower bound for all ldquon3-likerdquo linear algebra (2)
                                                                                                                                                            • Lower bound for all ldquon3-likerdquo linear algebra (3)
                                                                                                                                                            • Limits to parallel scaling (12)
                                                                                                                                                            • Limits to parallel scaling (22)
                                                                                                                                                            • Can we attain these lower bounds
                                                                                                                                                            • Outline (3)
                                                                                                                                                            • 25D Matrix Multiplication
                                                                                                                                                            • 25D Matrix Multiplication (2)
                                                                                                                                                            • 25D Matmul on BGP 16K nodes 64K cores (2)
                                                                                                                                                            • Perfect Strong Scaling ndash in Time and Energy (12)
                                                                                                                                                            • Perfect Strong Scaling ndash in Time and Energy (22)
                                                                                                                                                            • Handling Heterogeneity
                                                                                                                                                            • Application to Tensor Contractions
                                                                                                                                                            • C(ijk) = Σm A(ijm)B(mk)
                                                                                                                                                            • Application to Tensor Contractions (2)
                                                                                                                                                            • Communication Lower Bounds for Strassen-like matmul algorithms
                                                                                                                                                            • vs
                                                                                                                                                            • Slide 26
                                                                                                                                                            • Strassen-like beyond matmul
                                                                                                                                                            • Cache and Network Oblivious Algorithms
                                                                                                                                                            • CARMA Performance Distributed Memory
                                                                                                                                                            • CARMA Performance Distributed Memory (2)
                                                                                                                                                            • CARMA Performance Shared Memory
                                                                                                                                                            • CARMA Performance Shared Memory (2)
                                                                                                                                                            • Why is CARMA Faster in Shared Memory
                                                                                                                                                            • Outline (4)
                                                                                                                                                            • One-sided Factorizations (LU QR) so far
                                                                                                                                                            • TSQR An Architecture-Dependent Algorithm
                                                                                                                                                            • Back to LU Using similar idea for TSLU as TSQR Use reduction
                                                                                                                                                            • Minimizing Communication in TSLU
                                                                                                                                                            • Making TSLU Numerically Stable
                                                                                                                                                            • Stability of LU using TSLU CALU
                                                                                                                                                            • Why is stability of TSLU just a ldquoThmrdquo
                                                                                                                                                            • Fixing TSLU
                                                                                                                                                            • 2D CALU with Tournament Pivoting
                                                                                                                                                            • 25D CALU with Tournament Pivoting (c=4 copies)
                                                                                                                                                            • Exascale Machine Parameters Source DOE Exascale Workshop
                                                                                                                                                            • Exascale predicted speedups for Gaussian Elimination 2D CA
                                                                                                                                                            • 25D vs 2D LU With and Without Pivoting
                                                                                                                                                            • Other CA algorithms for Ax=b least squares(13)
                                                                                                                                                            • Other CA algorithms for Ax=b least squares (23)
                                                                                                                                                            • Other CA algorithms for Ax=b least squares (33)
                                                                                                                                                            • Outline (5)
                                                                                                                                                            • What about sparse matrices (13)
                                                                                                                                                            • Performance of 25D APSP using Kleene
                                                                                                                                                            • What about sparse matrices (23)
                                                                                                                                                            • What about sparse matrices (33)
                                                                                                                                                            • Outline (6)
                                                                                                                                                            • Symmetric Eigenproblem and SVD
                                                                                                                                                            • Slide 58
                                                                                                                                                            • Slide 59
                                                                                                                                                            • Slide 60
                                                                                                                                                            • Slide 61
                                                                                                                                                            • Slide 62
                                                                                                                                                            • Slide 63
                                                                                                                                                            • Slide 64
                                                                                                                                                            • Slide 65
                                                                                                                                                            • Slide 66
                                                                                                                                                            • Slide 67
                                                                                                                                                            • Slide 68
                                                                                                                                                            • Conventional vs CA - SBR
                                                                                                                                                            • Speedups of Sym Band Reduction vs DSBTRD
                                                                                                                                                            • Nonsymmetric Eigenproblem
                                                                                                                                                            • Attaining the Lower bounds Sequential
                                                                                                                                                            • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
                                                                                                                                                            • Outline (7)
                                                                                                                                                            • Avoiding Communication in Iterative Linear Algebra
                                                                                                                                                            • Outline (8)
                                                                                                                                                            • Example The Difficulty of Tuning SpMV
                                                                                                                                                            • Example The Difficulty of Tuning
                                                                                                                                                            • Speedups on Itanium 2 The Need for Search
                                                                                                                                                            • Register Profile Itanium 2
                                                                                                                                                            • Register Profiles IBM and Intel IA-64
                                                                                                                                                            • Another example of tuning challenges for SpMV
                                                                                                                                                            • Zoom in to top corner
                                                                                                                                                            • 3x3 blocks look natural buthellip
                                                                                                                                                            • Extra Work Can Improve Efficiency
                                                                                                                                                            • Slide 86
                                                                                                                                                            • Slide 87
                                                                                                                                                            • Slide 88
                                                                                                                                                            • Slide 89
                                                                                                                                                            • Summary of Other Performance Optimizations
                                                                                                                                                            • Optimized Sparse Kernel Interface - OSKI
                                                                                                                                                            • Outline (9)
                                                                                                                                                            • Example Classical Conjugate Gradient (CG)
                                                                                                                                                            • Example CA-Conjugate Gradient
                                                                                                                                                            • Outline (10)
                                                                                                                                                            • Slide 96
                                                                                                                                                            • Slide 97
                                                                                                                                                            • Outline (11)
                                                                                                                                                            • What is a ldquosparse matrixrdquo
                                                                                                                                                            • Outline (12)
                                                                                                                                                            • Reproducible Floating Point Computation
                                                                                                                                                            • Intel MKL non-reproducibility
                                                                                                                                                            • GoalsApproaches for Reproducibility
                                                                                                                                                            • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
                                                                                                                                                            • Collaborators and Supporters
                                                                                                                                                            • Summary

                                                                                                                                                              Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16

                                                                                                                                                              Itanium 2 - 33Itanium 1 - 8

                                                                                                                                                              252 Mflops

                                                                                                                                                              122 Mflops

                                                                                                                                                              820 Mflops

                                                                                                                                                              459 Mflops

                                                                                                                                                              247 Mflops

                                                                                                                                                              107 Mflops

                                                                                                                                                              12 Gflops

                                                                                                                                                              190 Mflops

                                                                                                                                                              Another example of tuning challenges for SpMV

                                                                                                                                                              bull Ex11 matrix (fluid flow)

                                                                                                                                                              bull More complicated non-zero structure in general

                                                                                                                                                              bull N = 16614bull NNZ = 11M

                                                                                                                                                              82

                                                                                                                                                              Zoom in to top corner

                                                                                                                                                              bull More complicated non-zero structure in general

                                                                                                                                                              bull N = 16614bull NNZ = 11M

                                                                                                                                                              83

                                                                                                                                                              3x3 blocks look natural buthellip

                                                                                                                                                              bull Example 3x3 blockingndash Logical grid of 3x3 cells

                                                                                                                                                              bull But would lead to lots of ldquofill-inrdquo

                                                                                                                                                              84

                                                                                                                                                              Extra Work Can Improve Efficiency

                                                                                                                                                              bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

                                                                                                                                                              bull On Pentium III 15x speedupndash Actual mflop rate 152 = 225 higher

                                                                                                                                                              85

                                                                                                                                                              Source Accelerator Cavity Design Problem (Ko via Husbands)

                                                                                                                                                              86

                                                                                                                                                              100x100 Submatrix Along Diagonal

                                                                                                                                                              Summer School Lecture 787

                                                                                                                                                              Post-RCM Reordering

                                                                                                                                                              88

                                                                                                                                                              Effect of Combined RCM+TSP Reordering

                                                                                                                                                              Before Green + RedAfter Green + Blue

                                                                                                                                                              Summer School Lecture 789

                                                                                                                                                              2x speedups on Pentium 4 Power 4 hellip

                                                                                                                                                              Summary of Other Performance Optimizations

                                                                                                                                                              bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

                                                                                                                                                              bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

                                                                                                                                                              bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

                                                                                                                                                              90

                                                                                                                                                              Optimized Sparse Kernel Interface - OSKI

                                                                                                                                                              bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

                                                                                                                                                              bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

                                                                                                                                                              bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

                                                                                                                                                              software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

                                                                                                                                                              91

                                                                                                                                                              Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                                                              ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                                                              ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                                                              bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                                                              bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                                                              93

                                                                                                                                                              Example Classical Conjugate Gradient (CG)

                                                                                                                                                              SpMVs and dot products require communication in

                                                                                                                                                              each iteration

                                                                                                                                                              via CA Matrix Powers Kernel

                                                                                                                                                              Global reduction to compute G

                                                                                                                                                              94

                                                                                                                                                              Example CA-Conjugate Gradient

                                                                                                                                                              Local computations within inner loop require

                                                                                                                                                              no communication

                                                                                                                                                              Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                                                              ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                                                              ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                                                              bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                                                              bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                                                              96

                                                                                                                                                              Slower convergence due

                                                                                                                                                              to roundoff

                                                                                                                                                              Loss of accuracy due to roundoff

                                                                                                                                                              At s = 16 monomial basis is rank deficient Method breaks down

                                                                                                                                                              Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

                                                                                                                                                              CA-CG (monomial)CG

                                                                                                                                                              machine precision

                                                                                                                                                              97

                                                                                                                                                              Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                                                              ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                                                              ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                                                              bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                                                              bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                                                              What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

                                                                                                                                                              bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

                                                                                                                                                              bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

                                                                                                                                                              matrices

                                                                                                                                                              Explicit (O(nnz)) Implicit (o(nnz))

                                                                                                                                                              Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

                                                                                                                                                              Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

                                                                                                                                                              Indices

                                                                                                                                                              Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                                                              ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                                                              ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                                                              bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                                                              bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                                                              101

                                                                                                                                                              bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

                                                                                                                                                              ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

                                                                                                                                                              demanded by customers (construction engineers) otherwise they donrsquot believe results

                                                                                                                                                              ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

                                                                                                                                                              ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

                                                                                                                                                              Reproducible Floating Point Computation

                                                                                                                                                              Absolute Error for Random Vectors

                                                                                                                                                              Same magnitude opposite signs

                                                                                                                                                              Intel MKL non-reproducibility

                                                                                                                                                              Relative Error for Orthogonal vectors

                                                                                                                                                              Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

                                                                                                                                                              Sign notreproducible

                                                                                                                                                              103

                                                                                                                                                              bull Consider summation or dot productbull Goals

                                                                                                                                                              1 Same answer independent of layout processors order of summands

                                                                                                                                                              2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

                                                                                                                                                              bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

                                                                                                                                                              GoalsApproaches for Reproducibility

                                                                                                                                                              104

                                                                                                                                                              Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

                                                                                                                                                              Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

                                                                                                                                                              Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

                                                                                                                                                              bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

                                                                                                                                                              Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

                                                                                                                                                              bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

                                                                                                                                                              Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

                                                                                                                                                              Instruments NEC Nokia NVIDIA Samsung Oracle

                                                                                                                                                              bull bebopcsberkeleyedu

                                                                                                                                                              Summary

                                                                                                                                                              Donrsquot Communichellip

                                                                                                                                                              106

                                                                                                                                                              Time to redesign all linear algebra n-body hellip algorithms and software

                                                                                                                                                              (and compilers)

                                                                                                                                                              • Implementing Communication-Avoiding Algorithms
                                                                                                                                                              • Why avoid communication
                                                                                                                                                              • Goals
                                                                                                                                                              • Outline
                                                                                                                                                              • Outline (2)
                                                                                                                                                              • Lower bound for all ldquon3-likerdquo linear algebra
                                                                                                                                                              • Lower bound for all ldquon3-likerdquo linear algebra (2)
                                                                                                                                                              • Lower bound for all ldquon3-likerdquo linear algebra (3)
                                                                                                                                                              • Limits to parallel scaling (12)
                                                                                                                                                              • Limits to parallel scaling (22)
                                                                                                                                                              • Can we attain these lower bounds
                                                                                                                                                              • Outline (3)
                                                                                                                                                              • 25D Matrix Multiplication
                                                                                                                                                              • 25D Matrix Multiplication (2)
                                                                                                                                                              • 25D Matmul on BGP 16K nodes 64K cores (2)
                                                                                                                                                              • Perfect Strong Scaling ndash in Time and Energy (12)
                                                                                                                                                              • Perfect Strong Scaling ndash in Time and Energy (22)
                                                                                                                                                              • Handling Heterogeneity
                                                                                                                                                              • Application to Tensor Contractions
                                                                                                                                                              • C(ijk) = Σm A(ijm)B(mk)
                                                                                                                                                              • Application to Tensor Contractions (2)
                                                                                                                                                              • Communication Lower Bounds for Strassen-like matmul algorithms
                                                                                                                                                              • vs
                                                                                                                                                              • Slide 26
                                                                                                                                                              • Strassen-like beyond matmul
                                                                                                                                                              • Cache and Network Oblivious Algorithms
                                                                                                                                                              • CARMA Performance Distributed Memory
                                                                                                                                                              • CARMA Performance Distributed Memory (2)
                                                                                                                                                              • CARMA Performance Shared Memory
                                                                                                                                                              • CARMA Performance Shared Memory (2)
                                                                                                                                                              • Why is CARMA Faster in Shared Memory
                                                                                                                                                              • Outline (4)
                                                                                                                                                              • One-sided Factorizations (LU QR) so far
                                                                                                                                                              • TSQR An Architecture-Dependent Algorithm
                                                                                                                                                              • Back to LU Using similar idea for TSLU as TSQR Use reduction
                                                                                                                                                              • Minimizing Communication in TSLU
                                                                                                                                                              • Making TSLU Numerically Stable
                                                                                                                                                              • Stability of LU using TSLU CALU
                                                                                                                                                              • Why is stability of TSLU just a ldquoThmrdquo
                                                                                                                                                              • Fixing TSLU
                                                                                                                                                              • 2D CALU with Tournament Pivoting
                                                                                                                                                              • 25D CALU with Tournament Pivoting (c=4 copies)
                                                                                                                                                              • Exascale Machine Parameters Source DOE Exascale Workshop
                                                                                                                                                              • Exascale predicted speedups for Gaussian Elimination 2D CA
                                                                                                                                                              • 25D vs 2D LU With and Without Pivoting
                                                                                                                                                              • Other CA algorithms for Ax=b least squares(13)
                                                                                                                                                              • Other CA algorithms for Ax=b least squares (23)
                                                                                                                                                              • Other CA algorithms for Ax=b least squares (33)
                                                                                                                                                              • Outline (5)
                                                                                                                                                              • What about sparse matrices (13)
                                                                                                                                                              • Performance of 25D APSP using Kleene
                                                                                                                                                              • What about sparse matrices (23)
                                                                                                                                                              • What about sparse matrices (33)
                                                                                                                                                              • Outline (6)
                                                                                                                                                              • Symmetric Eigenproblem and SVD
                                                                                                                                                              • Slide 58
                                                                                                                                                              • Slide 59
                                                                                                                                                              • Slide 60
                                                                                                                                                              • Slide 61
                                                                                                                                                              • Slide 62
                                                                                                                                                              • Slide 63
                                                                                                                                                              • Slide 64
                                                                                                                                                              • Slide 65
                                                                                                                                                              • Slide 66
                                                                                                                                                              • Slide 67
                                                                                                                                                              • Slide 68
                                                                                                                                                              • Conventional vs CA - SBR
                                                                                                                                                              • Speedups of Sym Band Reduction vs DSBTRD
                                                                                                                                                              • Nonsymmetric Eigenproblem
                                                                                                                                                              • Attaining the Lower bounds Sequential
                                                                                                                                                              • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
                                                                                                                                                              • Outline (7)
                                                                                                                                                              • Avoiding Communication in Iterative Linear Algebra
                                                                                                                                                              • Outline (8)
                                                                                                                                                              • Example The Difficulty of Tuning SpMV
                                                                                                                                                              • Example The Difficulty of Tuning
                                                                                                                                                              • Speedups on Itanium 2 The Need for Search
                                                                                                                                                              • Register Profile Itanium 2
                                                                                                                                                              • Register Profiles IBM and Intel IA-64
                                                                                                                                                              • Another example of tuning challenges for SpMV
                                                                                                                                                              • Zoom in to top corner
                                                                                                                                                              • 3x3 blocks look natural buthellip
                                                                                                                                                              • Extra Work Can Improve Efficiency
                                                                                                                                                              • Slide 86
                                                                                                                                                              • Slide 87
                                                                                                                                                              • Slide 88
                                                                                                                                                              • Slide 89
                                                                                                                                                              • Summary of Other Performance Optimizations
                                                                                                                                                              • Optimized Sparse Kernel Interface - OSKI
                                                                                                                                                              • Outline (9)
                                                                                                                                                              • Example Classical Conjugate Gradient (CG)
                                                                                                                                                              • Example CA-Conjugate Gradient
                                                                                                                                                              • Outline (10)
                                                                                                                                                              • Slide 96
                                                                                                                                                              • Slide 97
                                                                                                                                                              • Outline (11)
                                                                                                                                                              • What is a ldquosparse matrixrdquo
                                                                                                                                                              • Outline (12)
                                                                                                                                                              • Reproducible Floating Point Computation
                                                                                                                                                              • Intel MKL non-reproducibility
                                                                                                                                                              • GoalsApproaches for Reproducibility
                                                                                                                                                              • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
                                                                                                                                                              • Collaborators and Supporters
                                                                                                                                                              • Summary

                                                                                                                                                                Another example of tuning challenges for SpMV

                                                                                                                                                                bull Ex11 matrix (fluid flow)

                                                                                                                                                                bull More complicated non-zero structure in general

                                                                                                                                                                bull N = 16614bull NNZ = 11M

                                                                                                                                                                82

                                                                                                                                                                Zoom in to top corner

                                                                                                                                                                bull More complicated non-zero structure in general

                                                                                                                                                                bull N = 16614bull NNZ = 11M

                                                                                                                                                                83

                                                                                                                                                                3x3 blocks look natural buthellip

                                                                                                                                                                bull Example 3x3 blockingndash Logical grid of 3x3 cells

                                                                                                                                                                bull But would lead to lots of ldquofill-inrdquo

                                                                                                                                                                84

                                                                                                                                                                Extra Work Can Improve Efficiency

                                                                                                                                                                bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

                                                                                                                                                                bull On Pentium III 15x speedupndash Actual mflop rate 152 = 225 higher

                                                                                                                                                                85

                                                                                                                                                                Source Accelerator Cavity Design Problem (Ko via Husbands)

                                                                                                                                                                86

                                                                                                                                                                100x100 Submatrix Along Diagonal

                                                                                                                                                                Summer School Lecture 787

                                                                                                                                                                Post-RCM Reordering

                                                                                                                                                                88

                                                                                                                                                                Effect of Combined RCM+TSP Reordering

                                                                                                                                                                Before Green + RedAfter Green + Blue

                                                                                                                                                                Summer School Lecture 789

                                                                                                                                                                2x speedups on Pentium 4 Power 4 hellip

                                                                                                                                                                Summary of Other Performance Optimizations

                                                                                                                                                                bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

                                                                                                                                                                bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

                                                                                                                                                                bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

                                                                                                                                                                90

                                                                                                                                                                Optimized Sparse Kernel Interface - OSKI

                                                                                                                                                                bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

                                                                                                                                                                bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

                                                                                                                                                                bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

                                                                                                                                                                software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

                                                                                                                                                                91

                                                                                                                                                                Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                                                                ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                                                                ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                                                                bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                                                                bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                                                                93

                                                                                                                                                                Example Classical Conjugate Gradient (CG)

                                                                                                                                                                SpMVs and dot products require communication in

                                                                                                                                                                each iteration

                                                                                                                                                                via CA Matrix Powers Kernel

                                                                                                                                                                Global reduction to compute G

                                                                                                                                                                94

                                                                                                                                                                Example CA-Conjugate Gradient

                                                                                                                                                                Local computations within inner loop require

                                                                                                                                                                no communication

                                                                                                                                                                Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                                                                ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                                                                ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                                                                bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                                                                bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                                                                96

                                                                                                                                                                Slower convergence due

                                                                                                                                                                to roundoff

                                                                                                                                                                Loss of accuracy due to roundoff

                                                                                                                                                                At s = 16 monomial basis is rank deficient Method breaks down

                                                                                                                                                                Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

                                                                                                                                                                CA-CG (monomial)CG

                                                                                                                                                                machine precision

                                                                                                                                                                97

                                                                                                                                                                Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                                                                ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                                                                ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                                                                bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                                                                bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                                                                What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

                                                                                                                                                                bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

                                                                                                                                                                bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

                                                                                                                                                                matrices

                                                                                                                                                                Explicit (O(nnz)) Implicit (o(nnz))

                                                                                                                                                                Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

                                                                                                                                                                Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

                                                                                                                                                                Indices

                                                                                                                                                                Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                                                                ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                                                                ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                                                                bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                                                                bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                                                                101

                                                                                                                                                                bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

                                                                                                                                                                ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

                                                                                                                                                                demanded by customers (construction engineers) otherwise they donrsquot believe results

                                                                                                                                                                ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

                                                                                                                                                                ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

                                                                                                                                                                Reproducible Floating Point Computation

                                                                                                                                                                Absolute Error for Random Vectors

                                                                                                                                                                Same magnitude opposite signs

                                                                                                                                                                Intel MKL non-reproducibility

                                                                                                                                                                Relative Error for Orthogonal vectors

                                                                                                                                                                Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

                                                                                                                                                                Sign notreproducible

                                                                                                                                                                103

                                                                                                                                                                bull Consider summation or dot productbull Goals

                                                                                                                                                                1 Same answer independent of layout processors order of summands

                                                                                                                                                                2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

                                                                                                                                                                bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

                                                                                                                                                                GoalsApproaches for Reproducibility

                                                                                                                                                                104

                                                                                                                                                                Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

                                                                                                                                                                Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

                                                                                                                                                                Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

                                                                                                                                                                bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

                                                                                                                                                                Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

                                                                                                                                                                bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

                                                                                                                                                                Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

                                                                                                                                                                Instruments NEC Nokia NVIDIA Samsung Oracle

                                                                                                                                                                bull bebopcsberkeleyedu

                                                                                                                                                                Summary

                                                                                                                                                                Donrsquot Communichellip

                                                                                                                                                                106

                                                                                                                                                                Time to redesign all linear algebra n-body hellip algorithms and software

                                                                                                                                                                (and compilers)

                                                                                                                                                                • Implementing Communication-Avoiding Algorithms
                                                                                                                                                                • Why avoid communication
                                                                                                                                                                • Goals
                                                                                                                                                                • Outline
                                                                                                                                                                • Outline (2)
                                                                                                                                                                • Lower bound for all ldquon3-likerdquo linear algebra
                                                                                                                                                                • Lower bound for all ldquon3-likerdquo linear algebra (2)
                                                                                                                                                                • Lower bound for all ldquon3-likerdquo linear algebra (3)
                                                                                                                                                                • Limits to parallel scaling (12)
                                                                                                                                                                • Limits to parallel scaling (22)
                                                                                                                                                                • Can we attain these lower bounds
                                                                                                                                                                • Outline (3)
                                                                                                                                                                • 25D Matrix Multiplication
                                                                                                                                                                • 25D Matrix Multiplication (2)
                                                                                                                                                                • 25D Matmul on BGP 16K nodes 64K cores (2)
                                                                                                                                                                • Perfect Strong Scaling ndash in Time and Energy (12)
                                                                                                                                                                • Perfect Strong Scaling ndash in Time and Energy (22)
                                                                                                                                                                • Handling Heterogeneity
                                                                                                                                                                • Application to Tensor Contractions
                                                                                                                                                                • C(ijk) = Σm A(ijm)B(mk)
                                                                                                                                                                • Application to Tensor Contractions (2)
                                                                                                                                                                • Communication Lower Bounds for Strassen-like matmul algorithms
                                                                                                                                                                • vs
                                                                                                                                                                • Slide 26
                                                                                                                                                                • Strassen-like beyond matmul
                                                                                                                                                                • Cache and Network Oblivious Algorithms
                                                                                                                                                                • CARMA Performance Distributed Memory
                                                                                                                                                                • CARMA Performance Distributed Memory (2)
                                                                                                                                                                • CARMA Performance Shared Memory
                                                                                                                                                                • CARMA Performance Shared Memory (2)
                                                                                                                                                                • Why is CARMA Faster in Shared Memory
                                                                                                                                                                • Outline (4)
                                                                                                                                                                • One-sided Factorizations (LU QR) so far
                                                                                                                                                                • TSQR An Architecture-Dependent Algorithm
                                                                                                                                                                • Back to LU Using similar idea for TSLU as TSQR Use reduction
                                                                                                                                                                • Minimizing Communication in TSLU
                                                                                                                                                                • Making TSLU Numerically Stable
                                                                                                                                                                • Stability of LU using TSLU CALU
                                                                                                                                                                • Why is stability of TSLU just a ldquoThmrdquo
                                                                                                                                                                • Fixing TSLU
                                                                                                                                                                • 2D CALU with Tournament Pivoting
                                                                                                                                                                • 25D CALU with Tournament Pivoting (c=4 copies)
                                                                                                                                                                • Exascale Machine Parameters Source DOE Exascale Workshop
                                                                                                                                                                • Exascale predicted speedups for Gaussian Elimination 2D CA
                                                                                                                                                                • 25D vs 2D LU With and Without Pivoting
                                                                                                                                                                • Other CA algorithms for Ax=b least squares(13)
                                                                                                                                                                • Other CA algorithms for Ax=b least squares (23)
                                                                                                                                                                • Other CA algorithms for Ax=b least squares (33)
                                                                                                                                                                • Outline (5)
                                                                                                                                                                • What about sparse matrices (13)
                                                                                                                                                                • Performance of 25D APSP using Kleene
                                                                                                                                                                • What about sparse matrices (23)
                                                                                                                                                                • What about sparse matrices (33)
                                                                                                                                                                • Outline (6)
                                                                                                                                                                • Symmetric Eigenproblem and SVD
                                                                                                                                                                • Slide 58
                                                                                                                                                                • Slide 59
                                                                                                                                                                • Slide 60
                                                                                                                                                                • Slide 61
                                                                                                                                                                • Slide 62
                                                                                                                                                                • Slide 63
                                                                                                                                                                • Slide 64
                                                                                                                                                                • Slide 65
                                                                                                                                                                • Slide 66
                                                                                                                                                                • Slide 67
                                                                                                                                                                • Slide 68
                                                                                                                                                                • Conventional vs CA - SBR
                                                                                                                                                                • Speedups of Sym Band Reduction vs DSBTRD
                                                                                                                                                                • Nonsymmetric Eigenproblem
                                                                                                                                                                • Attaining the Lower bounds Sequential
                                                                                                                                                                • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
                                                                                                                                                                • Outline (7)
                                                                                                                                                                • Avoiding Communication in Iterative Linear Algebra
                                                                                                                                                                • Outline (8)
                                                                                                                                                                • Example The Difficulty of Tuning SpMV
                                                                                                                                                                • Example The Difficulty of Tuning
                                                                                                                                                                • Speedups on Itanium 2 The Need for Search
                                                                                                                                                                • Register Profile Itanium 2
                                                                                                                                                                • Register Profiles IBM and Intel IA-64
                                                                                                                                                                • Another example of tuning challenges for SpMV
                                                                                                                                                                • Zoom in to top corner
                                                                                                                                                                • 3x3 blocks look natural buthellip
                                                                                                                                                                • Extra Work Can Improve Efficiency
                                                                                                                                                                • Slide 86
                                                                                                                                                                • Slide 87
                                                                                                                                                                • Slide 88
                                                                                                                                                                • Slide 89
                                                                                                                                                                • Summary of Other Performance Optimizations
                                                                                                                                                                • Optimized Sparse Kernel Interface - OSKI
                                                                                                                                                                • Outline (9)
                                                                                                                                                                • Example Classical Conjugate Gradient (CG)
                                                                                                                                                                • Example CA-Conjugate Gradient
                                                                                                                                                                • Outline (10)
                                                                                                                                                                • Slide 96
                                                                                                                                                                • Slide 97
                                                                                                                                                                • Outline (11)
                                                                                                                                                                • What is a ldquosparse matrixrdquo
                                                                                                                                                                • Outline (12)
                                                                                                                                                                • Reproducible Floating Point Computation
                                                                                                                                                                • Intel MKL non-reproducibility
                                                                                                                                                                • GoalsApproaches for Reproducibility
                                                                                                                                                                • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
                                                                                                                                                                • Collaborators and Supporters
                                                                                                                                                                • Summary

                                                                                                                                                                  Zoom in to top corner

                                                                                                                                                                  bull More complicated non-zero structure in general

                                                                                                                                                                  bull N = 16614bull NNZ = 11M

                                                                                                                                                                  83

                                                                                                                                                                  3x3 blocks look natural buthellip

                                                                                                                                                                  bull Example 3x3 blockingndash Logical grid of 3x3 cells

                                                                                                                                                                  bull But would lead to lots of ldquofill-inrdquo

                                                                                                                                                                  84

                                                                                                                                                                  Extra Work Can Improve Efficiency

                                                                                                                                                                  bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

                                                                                                                                                                  bull On Pentium III 15x speedupndash Actual mflop rate 152 = 225 higher

                                                                                                                                                                  85

                                                                                                                                                                  Source Accelerator Cavity Design Problem (Ko via Husbands)

                                                                                                                                                                  86

                                                                                                                                                                  100x100 Submatrix Along Diagonal

                                                                                                                                                                  Summer School Lecture 787

                                                                                                                                                                  Post-RCM Reordering

                                                                                                                                                                  88

                                                                                                                                                                  Effect of Combined RCM+TSP Reordering

                                                                                                                                                                  Before Green + RedAfter Green + Blue

                                                                                                                                                                  Summer School Lecture 789

                                                                                                                                                                  2x speedups on Pentium 4 Power 4 hellip

                                                                                                                                                                  Summary of Other Performance Optimizations

                                                                                                                                                                  bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

                                                                                                                                                                  bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

                                                                                                                                                                  bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

                                                                                                                                                                  90

                                                                                                                                                                  Optimized Sparse Kernel Interface - OSKI

                                                                                                                                                                  bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

                                                                                                                                                                  bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

                                                                                                                                                                  bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

                                                                                                                                                                  software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

                                                                                                                                                                  91

                                                                                                                                                                  Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                                                                  ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                                                                  ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                                                                  bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                                                                  bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                                                                  93

                                                                                                                                                                  Example Classical Conjugate Gradient (CG)

                                                                                                                                                                  SpMVs and dot products require communication in

                                                                                                                                                                  each iteration

                                                                                                                                                                  via CA Matrix Powers Kernel

                                                                                                                                                                  Global reduction to compute G

                                                                                                                                                                  94

                                                                                                                                                                  Example CA-Conjugate Gradient

                                                                                                                                                                  Local computations within inner loop require

                                                                                                                                                                  no communication

                                                                                                                                                                  Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                                                                  ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                                                                  ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                                                                  bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                                                                  bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                                                                  96

                                                                                                                                                                  Slower convergence due

                                                                                                                                                                  to roundoff

                                                                                                                                                                  Loss of accuracy due to roundoff

                                                                                                                                                                  At s = 16 monomial basis is rank deficient Method breaks down

                                                                                                                                                                  Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

                                                                                                                                                                  CA-CG (monomial)CG

                                                                                                                                                                  machine precision

                                                                                                                                                                  97

                                                                                                                                                                  Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                                                                  ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                                                                  ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                                                                  bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                                                                  bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                                                                  What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

                                                                                                                                                                  bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

                                                                                                                                                                  bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

                                                                                                                                                                  matrices

                                                                                                                                                                  Explicit (O(nnz)) Implicit (o(nnz))

                                                                                                                                                                  Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

                                                                                                                                                                  Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

                                                                                                                                                                  Indices

                                                                                                                                                                  Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                                                                  ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                                                                  ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                                                                  bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                                                                  bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                                                                  101

                                                                                                                                                                  bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

                                                                                                                                                                  ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

                                                                                                                                                                  demanded by customers (construction engineers) otherwise they donrsquot believe results

                                                                                                                                                                  ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

                                                                                                                                                                  ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

                                                                                                                                                                  Reproducible Floating Point Computation

                                                                                                                                                                  Absolute Error for Random Vectors

                                                                                                                                                                  Same magnitude opposite signs

                                                                                                                                                                  Intel MKL non-reproducibility

                                                                                                                                                                  Relative Error for Orthogonal vectors

                                                                                                                                                                  Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

                                                                                                                                                                  Sign notreproducible

                                                                                                                                                                  103

                                                                                                                                                                  bull Consider summation or dot productbull Goals

                                                                                                                                                                  1 Same answer independent of layout processors order of summands

                                                                                                                                                                  2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

                                                                                                                                                                  bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

                                                                                                                                                                  GoalsApproaches for Reproducibility

                                                                                                                                                                  104

                                                                                                                                                                  Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

                                                                                                                                                                  Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

                                                                                                                                                                  Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

                                                                                                                                                                  bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

                                                                                                                                                                  Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

                                                                                                                                                                  bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

                                                                                                                                                                  Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

                                                                                                                                                                  Instruments NEC Nokia NVIDIA Samsung Oracle

                                                                                                                                                                  bull bebopcsberkeleyedu

                                                                                                                                                                  Summary

                                                                                                                                                                  Donrsquot Communichellip

                                                                                                                                                                  106

                                                                                                                                                                  Time to redesign all linear algebra n-body hellip algorithms and software

                                                                                                                                                                  (and compilers)

                                                                                                                                                                  • Implementing Communication-Avoiding Algorithms
                                                                                                                                                                  • Why avoid communication
                                                                                                                                                                  • Goals
                                                                                                                                                                  • Outline
                                                                                                                                                                  • Outline (2)
                                                                                                                                                                  • Lower bound for all ldquon3-likerdquo linear algebra
                                                                                                                                                                  • Lower bound for all ldquon3-likerdquo linear algebra (2)
                                                                                                                                                                  • Lower bound for all ldquon3-likerdquo linear algebra (3)
                                                                                                                                                                  • Limits to parallel scaling (12)
                                                                                                                                                                  • Limits to parallel scaling (22)
                                                                                                                                                                  • Can we attain these lower bounds
                                                                                                                                                                  • Outline (3)
                                                                                                                                                                  • 25D Matrix Multiplication
                                                                                                                                                                  • 25D Matrix Multiplication (2)
                                                                                                                                                                  • 25D Matmul on BGP 16K nodes 64K cores (2)
                                                                                                                                                                  • Perfect Strong Scaling ndash in Time and Energy (12)
                                                                                                                                                                  • Perfect Strong Scaling ndash in Time and Energy (22)
                                                                                                                                                                  • Handling Heterogeneity
                                                                                                                                                                  • Application to Tensor Contractions
                                                                                                                                                                  • C(ijk) = Σm A(ijm)B(mk)
                                                                                                                                                                  • Application to Tensor Contractions (2)
                                                                                                                                                                  • Communication Lower Bounds for Strassen-like matmul algorithms
                                                                                                                                                                  • vs
                                                                                                                                                                  • Slide 26
                                                                                                                                                                  • Strassen-like beyond matmul
                                                                                                                                                                  • Cache and Network Oblivious Algorithms
                                                                                                                                                                  • CARMA Performance Distributed Memory
                                                                                                                                                                  • CARMA Performance Distributed Memory (2)
                                                                                                                                                                  • CARMA Performance Shared Memory
                                                                                                                                                                  • CARMA Performance Shared Memory (2)
                                                                                                                                                                  • Why is CARMA Faster in Shared Memory
                                                                                                                                                                  • Outline (4)
                                                                                                                                                                  • One-sided Factorizations (LU QR) so far
                                                                                                                                                                  • TSQR An Architecture-Dependent Algorithm
                                                                                                                                                                  • Back to LU Using similar idea for TSLU as TSQR Use reduction
                                                                                                                                                                  • Minimizing Communication in TSLU
                                                                                                                                                                  • Making TSLU Numerically Stable
                                                                                                                                                                  • Stability of LU using TSLU CALU
                                                                                                                                                                  • Why is stability of TSLU just a ldquoThmrdquo
                                                                                                                                                                  • Fixing TSLU
                                                                                                                                                                  • 2D CALU with Tournament Pivoting
                                                                                                                                                                  • 25D CALU with Tournament Pivoting (c=4 copies)
                                                                                                                                                                  • Exascale Machine Parameters Source DOE Exascale Workshop
                                                                                                                                                                  • Exascale predicted speedups for Gaussian Elimination 2D CA
                                                                                                                                                                  • 25D vs 2D LU With and Without Pivoting
                                                                                                                                                                  • Other CA algorithms for Ax=b least squares(13)
                                                                                                                                                                  • Other CA algorithms for Ax=b least squares (23)
                                                                                                                                                                  • Other CA algorithms for Ax=b least squares (33)
                                                                                                                                                                  • Outline (5)
                                                                                                                                                                  • What about sparse matrices (13)
                                                                                                                                                                  • Performance of 25D APSP using Kleene
                                                                                                                                                                  • What about sparse matrices (23)
                                                                                                                                                                  • What about sparse matrices (33)
                                                                                                                                                                  • Outline (6)
                                                                                                                                                                  • Symmetric Eigenproblem and SVD
                                                                                                                                                                  • Slide 58
                                                                                                                                                                  • Slide 59
                                                                                                                                                                  • Slide 60
                                                                                                                                                                  • Slide 61
                                                                                                                                                                  • Slide 62
                                                                                                                                                                  • Slide 63
                                                                                                                                                                  • Slide 64
                                                                                                                                                                  • Slide 65
                                                                                                                                                                  • Slide 66
                                                                                                                                                                  • Slide 67
                                                                                                                                                                  • Slide 68
                                                                                                                                                                  • Conventional vs CA - SBR
                                                                                                                                                                  • Speedups of Sym Band Reduction vs DSBTRD
                                                                                                                                                                  • Nonsymmetric Eigenproblem
                                                                                                                                                                  • Attaining the Lower bounds Sequential
                                                                                                                                                                  • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
                                                                                                                                                                  • Outline (7)
                                                                                                                                                                  • Avoiding Communication in Iterative Linear Algebra
                                                                                                                                                                  • Outline (8)
                                                                                                                                                                  • Example The Difficulty of Tuning SpMV
                                                                                                                                                                  • Example The Difficulty of Tuning
                                                                                                                                                                  • Speedups on Itanium 2 The Need for Search
                                                                                                                                                                  • Register Profile Itanium 2
                                                                                                                                                                  • Register Profiles IBM and Intel IA-64
                                                                                                                                                                  • Another example of tuning challenges for SpMV
                                                                                                                                                                  • Zoom in to top corner
                                                                                                                                                                  • 3x3 blocks look natural buthellip
                                                                                                                                                                  • Extra Work Can Improve Efficiency
                                                                                                                                                                  • Slide 86
                                                                                                                                                                  • Slide 87
                                                                                                                                                                  • Slide 88
                                                                                                                                                                  • Slide 89
                                                                                                                                                                  • Summary of Other Performance Optimizations
                                                                                                                                                                  • Optimized Sparse Kernel Interface - OSKI
                                                                                                                                                                  • Outline (9)
                                                                                                                                                                  • Example Classical Conjugate Gradient (CG)
                                                                                                                                                                  • Example CA-Conjugate Gradient
                                                                                                                                                                  • Outline (10)
                                                                                                                                                                  • Slide 96
                                                                                                                                                                  • Slide 97
                                                                                                                                                                  • Outline (11)
                                                                                                                                                                  • What is a ldquosparse matrixrdquo
                                                                                                                                                                  • Outline (12)
                                                                                                                                                                  • Reproducible Floating Point Computation
                                                                                                                                                                  • Intel MKL non-reproducibility
                                                                                                                                                                  • GoalsApproaches for Reproducibility
                                                                                                                                                                  • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
                                                                                                                                                                  • Collaborators and Supporters
                                                                                                                                                                  • Summary

                                                                                                                                                                    3x3 blocks look natural buthellip

                                                                                                                                                                    bull Example 3x3 blockingndash Logical grid of 3x3 cells

                                                                                                                                                                    bull But would lead to lots of ldquofill-inrdquo

                                                                                                                                                                    84

                                                                                                                                                                    Extra Work Can Improve Efficiency

                                                                                                                                                                    bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

                                                                                                                                                                    bull On Pentium III 15x speedupndash Actual mflop rate 152 = 225 higher

                                                                                                                                                                    85

                                                                                                                                                                    Source Accelerator Cavity Design Problem (Ko via Husbands)

                                                                                                                                                                    86

                                                                                                                                                                    100x100 Submatrix Along Diagonal

                                                                                                                                                                    Summer School Lecture 787

                                                                                                                                                                    Post-RCM Reordering

                                                                                                                                                                    88

                                                                                                                                                                    Effect of Combined RCM+TSP Reordering

                                                                                                                                                                    Before Green + RedAfter Green + Blue

                                                                                                                                                                    Summer School Lecture 789

                                                                                                                                                                    2x speedups on Pentium 4 Power 4 hellip

                                                                                                                                                                    Summary of Other Performance Optimizations

                                                                                                                                                                    bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

                                                                                                                                                                    bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

                                                                                                                                                                    bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

                                                                                                                                                                    90

                                                                                                                                                                    Optimized Sparse Kernel Interface - OSKI

                                                                                                                                                                    bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

                                                                                                                                                                    bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

                                                                                                                                                                    bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

                                                                                                                                                                    software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

                                                                                                                                                                    91

                                                                                                                                                                    Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                                                                    ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                                                                    ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                                                                    bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                                                                    bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                                                                    93

                                                                                                                                                                    Example Classical Conjugate Gradient (CG)

                                                                                                                                                                    SpMVs and dot products require communication in

                                                                                                                                                                    each iteration

                                                                                                                                                                    via CA Matrix Powers Kernel

                                                                                                                                                                    Global reduction to compute G

                                                                                                                                                                    94

                                                                                                                                                                    Example CA-Conjugate Gradient

                                                                                                                                                                    Local computations within inner loop require

                                                                                                                                                                    no communication

                                                                                                                                                                    Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                                                                    ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                                                                    ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                                                                    bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                                                                    bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                                                                    96

                                                                                                                                                                    Slower convergence due

                                                                                                                                                                    to roundoff

                                                                                                                                                                    Loss of accuracy due to roundoff

                                                                                                                                                                    At s = 16 monomial basis is rank deficient Method breaks down

                                                                                                                                                                    Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

                                                                                                                                                                    CA-CG (monomial)CG

                                                                                                                                                                    machine precision

                                                                                                                                                                    97

                                                                                                                                                                    Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                                                                    ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                                                                    ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                                                                    bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                                                                    bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                                                                    What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

                                                                                                                                                                    bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

                                                                                                                                                                    bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

                                                                                                                                                                    matrices

                                                                                                                                                                    Explicit (O(nnz)) Implicit (o(nnz))

                                                                                                                                                                    Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

                                                                                                                                                                    Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

                                                                                                                                                                    Indices

                                                                                                                                                                    Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                                                                    ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                                                                    ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                                                                    bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                                                                    bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                                                                    101

                                                                                                                                                                    bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

                                                                                                                                                                    ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

                                                                                                                                                                    demanded by customers (construction engineers) otherwise they donrsquot believe results

                                                                                                                                                                    ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

                                                                                                                                                                    ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

                                                                                                                                                                    Reproducible Floating Point Computation

                                                                                                                                                                    Absolute Error for Random Vectors

                                                                                                                                                                    Same magnitude opposite signs

                                                                                                                                                                    Intel MKL non-reproducibility

                                                                                                                                                                    Relative Error for Orthogonal vectors

                                                                                                                                                                    Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

                                                                                                                                                                    Sign notreproducible

                                                                                                                                                                    103

                                                                                                                                                                    bull Consider summation or dot productbull Goals

                                                                                                                                                                    1 Same answer independent of layout processors order of summands

                                                                                                                                                                    2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

                                                                                                                                                                    bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

                                                                                                                                                                    GoalsApproaches for Reproducibility

                                                                                                                                                                    104

                                                                                                                                                                    Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

                                                                                                                                                                    Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

                                                                                                                                                                    Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

                                                                                                                                                                    bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

                                                                                                                                                                    Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

                                                                                                                                                                    bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

                                                                                                                                                                    Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

                                                                                                                                                                    Instruments NEC Nokia NVIDIA Samsung Oracle

                                                                                                                                                                    bull bebopcsberkeleyedu

                                                                                                                                                                    Summary

                                                                                                                                                                    Donrsquot Communichellip

                                                                                                                                                                    106

                                                                                                                                                                    Time to redesign all linear algebra n-body hellip algorithms and software

                                                                                                                                                                    (and compilers)

                                                                                                                                                                    • Implementing Communication-Avoiding Algorithms
                                                                                                                                                                    • Why avoid communication
                                                                                                                                                                    • Goals
                                                                                                                                                                    • Outline
                                                                                                                                                                    • Outline (2)
                                                                                                                                                                    • Lower bound for all ldquon3-likerdquo linear algebra
                                                                                                                                                                    • Lower bound for all ldquon3-likerdquo linear algebra (2)
                                                                                                                                                                    • Lower bound for all ldquon3-likerdquo linear algebra (3)
                                                                                                                                                                    • Limits to parallel scaling (12)
                                                                                                                                                                    • Limits to parallel scaling (22)
                                                                                                                                                                    • Can we attain these lower bounds
                                                                                                                                                                    • Outline (3)
                                                                                                                                                                    • 25D Matrix Multiplication
                                                                                                                                                                    • 25D Matrix Multiplication (2)
                                                                                                                                                                    • 25D Matmul on BGP 16K nodes 64K cores (2)
                                                                                                                                                                    • Perfect Strong Scaling ndash in Time and Energy (12)
                                                                                                                                                                    • Perfect Strong Scaling ndash in Time and Energy (22)
                                                                                                                                                                    • Handling Heterogeneity
                                                                                                                                                                    • Application to Tensor Contractions
                                                                                                                                                                    • C(ijk) = Σm A(ijm)B(mk)
                                                                                                                                                                    • Application to Tensor Contractions (2)
                                                                                                                                                                    • Communication Lower Bounds for Strassen-like matmul algorithms
                                                                                                                                                                    • vs
                                                                                                                                                                    • Slide 26
                                                                                                                                                                    • Strassen-like beyond matmul
                                                                                                                                                                    • Cache and Network Oblivious Algorithms
                                                                                                                                                                    • CARMA Performance Distributed Memory
                                                                                                                                                                    • CARMA Performance Distributed Memory (2)
                                                                                                                                                                    • CARMA Performance Shared Memory
                                                                                                                                                                    • CARMA Performance Shared Memory (2)
                                                                                                                                                                    • Why is CARMA Faster in Shared Memory
                                                                                                                                                                    • Outline (4)
                                                                                                                                                                    • One-sided Factorizations (LU QR) so far
                                                                                                                                                                    • TSQR An Architecture-Dependent Algorithm
                                                                                                                                                                    • Back to LU Using similar idea for TSLU as TSQR Use reduction
                                                                                                                                                                    • Minimizing Communication in TSLU
                                                                                                                                                                    • Making TSLU Numerically Stable
                                                                                                                                                                    • Stability of LU using TSLU CALU
                                                                                                                                                                    • Why is stability of TSLU just a ldquoThmrdquo
                                                                                                                                                                    • Fixing TSLU
                                                                                                                                                                    • 2D CALU with Tournament Pivoting
                                                                                                                                                                    • 25D CALU with Tournament Pivoting (c=4 copies)
                                                                                                                                                                    • Exascale Machine Parameters Source DOE Exascale Workshop
                                                                                                                                                                    • Exascale predicted speedups for Gaussian Elimination 2D CA
                                                                                                                                                                    • 25D vs 2D LU With and Without Pivoting
                                                                                                                                                                    • Other CA algorithms for Ax=b least squares(13)
                                                                                                                                                                    • Other CA algorithms for Ax=b least squares (23)
                                                                                                                                                                    • Other CA algorithms for Ax=b least squares (33)
                                                                                                                                                                    • Outline (5)
                                                                                                                                                                    • What about sparse matrices (13)
                                                                                                                                                                    • Performance of 25D APSP using Kleene
                                                                                                                                                                    • What about sparse matrices (23)
                                                                                                                                                                    • What about sparse matrices (33)
                                                                                                                                                                    • Outline (6)
                                                                                                                                                                    • Symmetric Eigenproblem and SVD
                                                                                                                                                                    • Slide 58
                                                                                                                                                                    • Slide 59
                                                                                                                                                                    • Slide 60
                                                                                                                                                                    • Slide 61
                                                                                                                                                                    • Slide 62
                                                                                                                                                                    • Slide 63
                                                                                                                                                                    • Slide 64
                                                                                                                                                                    • Slide 65
                                                                                                                                                                    • Slide 66
                                                                                                                                                                    • Slide 67
                                                                                                                                                                    • Slide 68
                                                                                                                                                                    • Conventional vs CA - SBR
                                                                                                                                                                    • Speedups of Sym Band Reduction vs DSBTRD
                                                                                                                                                                    • Nonsymmetric Eigenproblem
                                                                                                                                                                    • Attaining the Lower bounds Sequential
                                                                                                                                                                    • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
                                                                                                                                                                    • Outline (7)
                                                                                                                                                                    • Avoiding Communication in Iterative Linear Algebra
                                                                                                                                                                    • Outline (8)
                                                                                                                                                                    • Example The Difficulty of Tuning SpMV
                                                                                                                                                                    • Example The Difficulty of Tuning
                                                                                                                                                                    • Speedups on Itanium 2 The Need for Search
                                                                                                                                                                    • Register Profile Itanium 2
                                                                                                                                                                    • Register Profiles IBM and Intel IA-64
                                                                                                                                                                    • Another example of tuning challenges for SpMV
                                                                                                                                                                    • Zoom in to top corner
                                                                                                                                                                    • 3x3 blocks look natural buthellip
                                                                                                                                                                    • Extra Work Can Improve Efficiency
                                                                                                                                                                    • Slide 86
                                                                                                                                                                    • Slide 87
                                                                                                                                                                    • Slide 88
                                                                                                                                                                    • Slide 89
                                                                                                                                                                    • Summary of Other Performance Optimizations
                                                                                                                                                                    • Optimized Sparse Kernel Interface - OSKI
                                                                                                                                                                    • Outline (9)
                                                                                                                                                                    • Example Classical Conjugate Gradient (CG)
                                                                                                                                                                    • Example CA-Conjugate Gradient
                                                                                                                                                                    • Outline (10)
                                                                                                                                                                    • Slide 96
                                                                                                                                                                    • Slide 97
                                                                                                                                                                    • Outline (11)
                                                                                                                                                                    • What is a ldquosparse matrixrdquo
                                                                                                                                                                    • Outline (12)
                                                                                                                                                                    • Reproducible Floating Point Computation
                                                                                                                                                                    • Intel MKL non-reproducibility
                                                                                                                                                                    • GoalsApproaches for Reproducibility
                                                                                                                                                                    • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
                                                                                                                                                                    • Collaborators and Supporters
                                                                                                                                                                    • Summary

                                                                                                                                                                      Extra Work Can Improve Efficiency

                                                                                                                                                                      bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

                                                                                                                                                                      bull On Pentium III 15x speedupndash Actual mflop rate 152 = 225 higher

                                                                                                                                                                      85

                                                                                                                                                                      Source Accelerator Cavity Design Problem (Ko via Husbands)

                                                                                                                                                                      86

                                                                                                                                                                      100x100 Submatrix Along Diagonal

                                                                                                                                                                      Summer School Lecture 787

                                                                                                                                                                      Post-RCM Reordering

                                                                                                                                                                      88

                                                                                                                                                                      Effect of Combined RCM+TSP Reordering

                                                                                                                                                                      Before Green + RedAfter Green + Blue

                                                                                                                                                                      Summer School Lecture 789

                                                                                                                                                                      2x speedups on Pentium 4 Power 4 hellip

                                                                                                                                                                      Summary of Other Performance Optimizations

                                                                                                                                                                      bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

                                                                                                                                                                      bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

                                                                                                                                                                      bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

                                                                                                                                                                      90

                                                                                                                                                                      Optimized Sparse Kernel Interface - OSKI

                                                                                                                                                                      bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

                                                                                                                                                                      bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

                                                                                                                                                                      bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

                                                                                                                                                                      software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

                                                                                                                                                                      91

                                                                                                                                                                      Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                                                                      ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                                                                      ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                                                                      bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                                                                      bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                                                                      93

                                                                                                                                                                      Example Classical Conjugate Gradient (CG)

                                                                                                                                                                      SpMVs and dot products require communication in

                                                                                                                                                                      each iteration

                                                                                                                                                                      via CA Matrix Powers Kernel

                                                                                                                                                                      Global reduction to compute G

                                                                                                                                                                      94

                                                                                                                                                                      Example CA-Conjugate Gradient

                                                                                                                                                                      Local computations within inner loop require

                                                                                                                                                                      no communication

                                                                                                                                                                      Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                                                                      ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                                                                      ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                                                                      bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                                                                      bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                                                                      96

                                                                                                                                                                      Slower convergence due

                                                                                                                                                                      to roundoff

                                                                                                                                                                      Loss of accuracy due to roundoff

                                                                                                                                                                      At s = 16 monomial basis is rank deficient Method breaks down

                                                                                                                                                                      Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

                                                                                                                                                                      CA-CG (monomial)CG

                                                                                                                                                                      machine precision

                                                                                                                                                                      97

                                                                                                                                                                      Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                                                                      ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                                                                      ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                                                                      bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                                                                      bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                                                                      What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

                                                                                                                                                                      bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

                                                                                                                                                                      bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

                                                                                                                                                                      matrices

                                                                                                                                                                      Explicit (O(nnz)) Implicit (o(nnz))

                                                                                                                                                                      Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

                                                                                                                                                                      Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

                                                                                                                                                                      Indices

                                                                                                                                                                      Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                                                                      ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                                                                      ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                                                                      bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                                                                      bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                                                                      101

                                                                                                                                                                      bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

                                                                                                                                                                      ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

                                                                                                                                                                      demanded by customers (construction engineers) otherwise they donrsquot believe results

                                                                                                                                                                      ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

                                                                                                                                                                      ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

                                                                                                                                                                      Reproducible Floating Point Computation

                                                                                                                                                                      Absolute Error for Random Vectors

                                                                                                                                                                      Same magnitude opposite signs

                                                                                                                                                                      Intel MKL non-reproducibility

                                                                                                                                                                      Relative Error for Orthogonal vectors

                                                                                                                                                                      Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

                                                                                                                                                                      Sign notreproducible

                                                                                                                                                                      103

                                                                                                                                                                      bull Consider summation or dot productbull Goals

                                                                                                                                                                      1 Same answer independent of layout processors order of summands

                                                                                                                                                                      2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

                                                                                                                                                                      bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

                                                                                                                                                                      GoalsApproaches for Reproducibility

                                                                                                                                                                      104

                                                                                                                                                                      Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

                                                                                                                                                                      Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

                                                                                                                                                                      Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

                                                                                                                                                                      bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

                                                                                                                                                                      Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

                                                                                                                                                                      bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

                                                                                                                                                                      Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

                                                                                                                                                                      Instruments NEC Nokia NVIDIA Samsung Oracle

                                                                                                                                                                      bull bebopcsberkeleyedu

                                                                                                                                                                      Summary

                                                                                                                                                                      Donrsquot Communichellip

                                                                                                                                                                      106

                                                                                                                                                                      Time to redesign all linear algebra n-body hellip algorithms and software

                                                                                                                                                                      (and compilers)

                                                                                                                                                                      • Implementing Communication-Avoiding Algorithms
                                                                                                                                                                      • Why avoid communication
                                                                                                                                                                      • Goals
                                                                                                                                                                      • Outline
                                                                                                                                                                      • Outline (2)
                                                                                                                                                                      • Lower bound for all ldquon3-likerdquo linear algebra
                                                                                                                                                                      • Lower bound for all ldquon3-likerdquo linear algebra (2)
                                                                                                                                                                      • Lower bound for all ldquon3-likerdquo linear algebra (3)
                                                                                                                                                                      • Limits to parallel scaling (12)
                                                                                                                                                                      • Limits to parallel scaling (22)
                                                                                                                                                                      • Can we attain these lower bounds
                                                                                                                                                                      • Outline (3)
                                                                                                                                                                      • 25D Matrix Multiplication
                                                                                                                                                                      • 25D Matrix Multiplication (2)
                                                                                                                                                                      • 25D Matmul on BGP 16K nodes 64K cores (2)
                                                                                                                                                                      • Perfect Strong Scaling ndash in Time and Energy (12)
                                                                                                                                                                      • Perfect Strong Scaling ndash in Time and Energy (22)
                                                                                                                                                                      • Handling Heterogeneity
                                                                                                                                                                      • Application to Tensor Contractions
                                                                                                                                                                      • C(ijk) = Σm A(ijm)B(mk)
                                                                                                                                                                      • Application to Tensor Contractions (2)
                                                                                                                                                                      • Communication Lower Bounds for Strassen-like matmul algorithms
                                                                                                                                                                      • vs
                                                                                                                                                                      • Slide 26
                                                                                                                                                                      • Strassen-like beyond matmul
                                                                                                                                                                      • Cache and Network Oblivious Algorithms
                                                                                                                                                                      • CARMA Performance Distributed Memory
                                                                                                                                                                      • CARMA Performance Distributed Memory (2)
                                                                                                                                                                      • CARMA Performance Shared Memory
                                                                                                                                                                      • CARMA Performance Shared Memory (2)
                                                                                                                                                                      • Why is CARMA Faster in Shared Memory
                                                                                                                                                                      • Outline (4)
                                                                                                                                                                      • One-sided Factorizations (LU QR) so far
                                                                                                                                                                      • TSQR An Architecture-Dependent Algorithm
                                                                                                                                                                      • Back to LU Using similar idea for TSLU as TSQR Use reduction
                                                                                                                                                                      • Minimizing Communication in TSLU
                                                                                                                                                                      • Making TSLU Numerically Stable
                                                                                                                                                                      • Stability of LU using TSLU CALU
                                                                                                                                                                      • Why is stability of TSLU just a ldquoThmrdquo
                                                                                                                                                                      • Fixing TSLU
                                                                                                                                                                      • 2D CALU with Tournament Pivoting
                                                                                                                                                                      • 25D CALU with Tournament Pivoting (c=4 copies)
                                                                                                                                                                      • Exascale Machine Parameters Source DOE Exascale Workshop
                                                                                                                                                                      • Exascale predicted speedups for Gaussian Elimination 2D CA
                                                                                                                                                                      • 25D vs 2D LU With and Without Pivoting
                                                                                                                                                                      • Other CA algorithms for Ax=b least squares(13)
                                                                                                                                                                      • Other CA algorithms for Ax=b least squares (23)
                                                                                                                                                                      • Other CA algorithms for Ax=b least squares (33)
                                                                                                                                                                      • Outline (5)
                                                                                                                                                                      • What about sparse matrices (13)
                                                                                                                                                                      • Performance of 25D APSP using Kleene
                                                                                                                                                                      • What about sparse matrices (23)
                                                                                                                                                                      • What about sparse matrices (33)
                                                                                                                                                                      • Outline (6)
                                                                                                                                                                      • Symmetric Eigenproblem and SVD
                                                                                                                                                                      • Slide 58
                                                                                                                                                                      • Slide 59
                                                                                                                                                                      • Slide 60
                                                                                                                                                                      • Slide 61
                                                                                                                                                                      • Slide 62
                                                                                                                                                                      • Slide 63
                                                                                                                                                                      • Slide 64
                                                                                                                                                                      • Slide 65
                                                                                                                                                                      • Slide 66
                                                                                                                                                                      • Slide 67
                                                                                                                                                                      • Slide 68
                                                                                                                                                                      • Conventional vs CA - SBR
                                                                                                                                                                      • Speedups of Sym Band Reduction vs DSBTRD
                                                                                                                                                                      • Nonsymmetric Eigenproblem
                                                                                                                                                                      • Attaining the Lower bounds Sequential
                                                                                                                                                                      • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
                                                                                                                                                                      • Outline (7)
                                                                                                                                                                      • Avoiding Communication in Iterative Linear Algebra
                                                                                                                                                                      • Outline (8)
                                                                                                                                                                      • Example The Difficulty of Tuning SpMV
                                                                                                                                                                      • Example The Difficulty of Tuning
                                                                                                                                                                      • Speedups on Itanium 2 The Need for Search
                                                                                                                                                                      • Register Profile Itanium 2
                                                                                                                                                                      • Register Profiles IBM and Intel IA-64
                                                                                                                                                                      • Another example of tuning challenges for SpMV
                                                                                                                                                                      • Zoom in to top corner
                                                                                                                                                                      • 3x3 blocks look natural buthellip
                                                                                                                                                                      • Extra Work Can Improve Efficiency
                                                                                                                                                                      • Slide 86
                                                                                                                                                                      • Slide 87
                                                                                                                                                                      • Slide 88
                                                                                                                                                                      • Slide 89
                                                                                                                                                                      • Summary of Other Performance Optimizations
                                                                                                                                                                      • Optimized Sparse Kernel Interface - OSKI
                                                                                                                                                                      • Outline (9)
                                                                                                                                                                      • Example Classical Conjugate Gradient (CG)
                                                                                                                                                                      • Example CA-Conjugate Gradient
                                                                                                                                                                      • Outline (10)
                                                                                                                                                                      • Slide 96
                                                                                                                                                                      • Slide 97
                                                                                                                                                                      • Outline (11)
                                                                                                                                                                      • What is a ldquosparse matrixrdquo
                                                                                                                                                                      • Outline (12)
                                                                                                                                                                      • Reproducible Floating Point Computation
                                                                                                                                                                      • Intel MKL non-reproducibility
                                                                                                                                                                      • GoalsApproaches for Reproducibility
                                                                                                                                                                      • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
                                                                                                                                                                      • Collaborators and Supporters
                                                                                                                                                                      • Summary

                                                                                                                                                                        Source Accelerator Cavity Design Problem (Ko via Husbands)

                                                                                                                                                                        86

                                                                                                                                                                        100x100 Submatrix Along Diagonal

                                                                                                                                                                        Summer School Lecture 787

                                                                                                                                                                        Post-RCM Reordering

                                                                                                                                                                        88

                                                                                                                                                                        Effect of Combined RCM+TSP Reordering

                                                                                                                                                                        Before Green + RedAfter Green + Blue

                                                                                                                                                                        Summer School Lecture 789

                                                                                                                                                                        2x speedups on Pentium 4 Power 4 hellip

                                                                                                                                                                        Summary of Other Performance Optimizations

                                                                                                                                                                        bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

                                                                                                                                                                        bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

                                                                                                                                                                        bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

                                                                                                                                                                        90

                                                                                                                                                                        Optimized Sparse Kernel Interface - OSKI

                                                                                                                                                                        bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

                                                                                                                                                                        bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

                                                                                                                                                                        bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

                                                                                                                                                                        software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

                                                                                                                                                                        91

                                                                                                                                                                        Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                                                                        ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                                                                        ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                                                                        bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                                                                        bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                                                                        93

                                                                                                                                                                        Example Classical Conjugate Gradient (CG)

                                                                                                                                                                        SpMVs and dot products require communication in

                                                                                                                                                                        each iteration

                                                                                                                                                                        via CA Matrix Powers Kernel

                                                                                                                                                                        Global reduction to compute G

                                                                                                                                                                        94

                                                                                                                                                                        Example CA-Conjugate Gradient

                                                                                                                                                                        Local computations within inner loop require

                                                                                                                                                                        no communication

                                                                                                                                                                        Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                                                                        ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                                                                        ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                                                                        bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                                                                        bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                                                                        96

                                                                                                                                                                        Slower convergence due

                                                                                                                                                                        to roundoff

                                                                                                                                                                        Loss of accuracy due to roundoff

                                                                                                                                                                        At s = 16 monomial basis is rank deficient Method breaks down

                                                                                                                                                                        Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

                                                                                                                                                                        CA-CG (monomial)CG

                                                                                                                                                                        machine precision

                                                                                                                                                                        97

                                                                                                                                                                        Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                                                                        ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                                                                        ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                                                                        bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                                                                        bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                                                                        What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

                                                                                                                                                                        bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

                                                                                                                                                                        bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

                                                                                                                                                                        matrices

                                                                                                                                                                        Explicit (O(nnz)) Implicit (o(nnz))

                                                                                                                                                                        Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

                                                                                                                                                                        Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

                                                                                                                                                                        Indices

                                                                                                                                                                        Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                                                                        ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                                                                        ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                                                                        bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                                                                        bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                                                                        101

                                                                                                                                                                        bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

                                                                                                                                                                        ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

                                                                                                                                                                        demanded by customers (construction engineers) otherwise they donrsquot believe results

                                                                                                                                                                        ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

                                                                                                                                                                        ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

                                                                                                                                                                        Reproducible Floating Point Computation

                                                                                                                                                                        Absolute Error for Random Vectors

                                                                                                                                                                        Same magnitude opposite signs

                                                                                                                                                                        Intel MKL non-reproducibility

                                                                                                                                                                        Relative Error for Orthogonal vectors

                                                                                                                                                                        Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

                                                                                                                                                                        Sign notreproducible

                                                                                                                                                                        103

                                                                                                                                                                        bull Consider summation or dot productbull Goals

                                                                                                                                                                        1 Same answer independent of layout processors order of summands

                                                                                                                                                                        2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

                                                                                                                                                                        bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

                                                                                                                                                                        GoalsApproaches for Reproducibility

                                                                                                                                                                        104

                                                                                                                                                                        Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

                                                                                                                                                                        Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

                                                                                                                                                                        Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

                                                                                                                                                                        bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

                                                                                                                                                                        Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

                                                                                                                                                                        bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

                                                                                                                                                                        Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

                                                                                                                                                                        Instruments NEC Nokia NVIDIA Samsung Oracle

                                                                                                                                                                        bull bebopcsberkeleyedu

                                                                                                                                                                        Summary

                                                                                                                                                                        Donrsquot Communichellip

                                                                                                                                                                        106

                                                                                                                                                                        Time to redesign all linear algebra n-body hellip algorithms and software

                                                                                                                                                                        (and compilers)

                                                                                                                                                                        • Implementing Communication-Avoiding Algorithms
                                                                                                                                                                        • Why avoid communication
                                                                                                                                                                        • Goals
                                                                                                                                                                        • Outline
                                                                                                                                                                        • Outline (2)
                                                                                                                                                                        • Lower bound for all ldquon3-likerdquo linear algebra
                                                                                                                                                                        • Lower bound for all ldquon3-likerdquo linear algebra (2)
                                                                                                                                                                        • Lower bound for all ldquon3-likerdquo linear algebra (3)
                                                                                                                                                                        • Limits to parallel scaling (12)
                                                                                                                                                                        • Limits to parallel scaling (22)
                                                                                                                                                                        • Can we attain these lower bounds
                                                                                                                                                                        • Outline (3)
                                                                                                                                                                        • 25D Matrix Multiplication
                                                                                                                                                                        • 25D Matrix Multiplication (2)
                                                                                                                                                                        • 25D Matmul on BGP 16K nodes 64K cores (2)
                                                                                                                                                                        • Perfect Strong Scaling ndash in Time and Energy (12)
                                                                                                                                                                        • Perfect Strong Scaling ndash in Time and Energy (22)
                                                                                                                                                                        • Handling Heterogeneity
                                                                                                                                                                        • Application to Tensor Contractions
                                                                                                                                                                        • C(ijk) = Σm A(ijm)B(mk)
                                                                                                                                                                        • Application to Tensor Contractions (2)
                                                                                                                                                                        • Communication Lower Bounds for Strassen-like matmul algorithms
                                                                                                                                                                        • vs
                                                                                                                                                                        • Slide 26
                                                                                                                                                                        • Strassen-like beyond matmul
                                                                                                                                                                        • Cache and Network Oblivious Algorithms
                                                                                                                                                                        • CARMA Performance Distributed Memory
                                                                                                                                                                        • CARMA Performance Distributed Memory (2)
                                                                                                                                                                        • CARMA Performance Shared Memory
                                                                                                                                                                        • CARMA Performance Shared Memory (2)
                                                                                                                                                                        • Why is CARMA Faster in Shared Memory
                                                                                                                                                                        • Outline (4)
                                                                                                                                                                        • One-sided Factorizations (LU QR) so far
                                                                                                                                                                        • TSQR An Architecture-Dependent Algorithm
                                                                                                                                                                        • Back to LU Using similar idea for TSLU as TSQR Use reduction
                                                                                                                                                                        • Minimizing Communication in TSLU
                                                                                                                                                                        • Making TSLU Numerically Stable
                                                                                                                                                                        • Stability of LU using TSLU CALU
                                                                                                                                                                        • Why is stability of TSLU just a ldquoThmrdquo
                                                                                                                                                                        • Fixing TSLU
                                                                                                                                                                        • 2D CALU with Tournament Pivoting
                                                                                                                                                                        • 25D CALU with Tournament Pivoting (c=4 copies)
                                                                                                                                                                        • Exascale Machine Parameters Source DOE Exascale Workshop
                                                                                                                                                                        • Exascale predicted speedups for Gaussian Elimination 2D CA
                                                                                                                                                                        • 25D vs 2D LU With and Without Pivoting
                                                                                                                                                                        • Other CA algorithms for Ax=b least squares(13)
                                                                                                                                                                        • Other CA algorithms for Ax=b least squares (23)
                                                                                                                                                                        • Other CA algorithms for Ax=b least squares (33)
                                                                                                                                                                        • Outline (5)
                                                                                                                                                                        • What about sparse matrices (13)
                                                                                                                                                                        • Performance of 25D APSP using Kleene
                                                                                                                                                                        • What about sparse matrices (23)
                                                                                                                                                                        • What about sparse matrices (33)
                                                                                                                                                                        • Outline (6)
                                                                                                                                                                        • Symmetric Eigenproblem and SVD
                                                                                                                                                                        • Slide 58
                                                                                                                                                                        • Slide 59
                                                                                                                                                                        • Slide 60
                                                                                                                                                                        • Slide 61
                                                                                                                                                                        • Slide 62
                                                                                                                                                                        • Slide 63
                                                                                                                                                                        • Slide 64
                                                                                                                                                                        • Slide 65
                                                                                                                                                                        • Slide 66
                                                                                                                                                                        • Slide 67
                                                                                                                                                                        • Slide 68
                                                                                                                                                                        • Conventional vs CA - SBR
                                                                                                                                                                        • Speedups of Sym Band Reduction vs DSBTRD
                                                                                                                                                                        • Nonsymmetric Eigenproblem
                                                                                                                                                                        • Attaining the Lower bounds Sequential
                                                                                                                                                                        • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
                                                                                                                                                                        • Outline (7)
                                                                                                                                                                        • Avoiding Communication in Iterative Linear Algebra
                                                                                                                                                                        • Outline (8)
                                                                                                                                                                        • Example The Difficulty of Tuning SpMV
                                                                                                                                                                        • Example The Difficulty of Tuning
                                                                                                                                                                        • Speedups on Itanium 2 The Need for Search
                                                                                                                                                                        • Register Profile Itanium 2
                                                                                                                                                                        • Register Profiles IBM and Intel IA-64
                                                                                                                                                                        • Another example of tuning challenges for SpMV
                                                                                                                                                                        • Zoom in to top corner
                                                                                                                                                                        • 3x3 blocks look natural buthellip
                                                                                                                                                                        • Extra Work Can Improve Efficiency
                                                                                                                                                                        • Slide 86
                                                                                                                                                                        • Slide 87
                                                                                                                                                                        • Slide 88
                                                                                                                                                                        • Slide 89
                                                                                                                                                                        • Summary of Other Performance Optimizations
                                                                                                                                                                        • Optimized Sparse Kernel Interface - OSKI
                                                                                                                                                                        • Outline (9)
                                                                                                                                                                        • Example Classical Conjugate Gradient (CG)
                                                                                                                                                                        • Example CA-Conjugate Gradient
                                                                                                                                                                        • Outline (10)
                                                                                                                                                                        • Slide 96
                                                                                                                                                                        • Slide 97
                                                                                                                                                                        • Outline (11)
                                                                                                                                                                        • What is a ldquosparse matrixrdquo
                                                                                                                                                                        • Outline (12)
                                                                                                                                                                        • Reproducible Floating Point Computation
                                                                                                                                                                        • Intel MKL non-reproducibility
                                                                                                                                                                        • GoalsApproaches for Reproducibility
                                                                                                                                                                        • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
                                                                                                                                                                        • Collaborators and Supporters
                                                                                                                                                                        • Summary

                                                                                                                                                                          100x100 Submatrix Along Diagonal

                                                                                                                                                                          Summer School Lecture 787

                                                                                                                                                                          Post-RCM Reordering

                                                                                                                                                                          88

                                                                                                                                                                          Effect of Combined RCM+TSP Reordering

                                                                                                                                                                          Before Green + RedAfter Green + Blue

                                                                                                                                                                          Summer School Lecture 789

                                                                                                                                                                          2x speedups on Pentium 4 Power 4 hellip

                                                                                                                                                                          Summary of Other Performance Optimizations

                                                                                                                                                                          bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

                                                                                                                                                                          bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

                                                                                                                                                                          bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

                                                                                                                                                                          90

                                                                                                                                                                          Optimized Sparse Kernel Interface - OSKI

                                                                                                                                                                          bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

                                                                                                                                                                          bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

                                                                                                                                                                          bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

                                                                                                                                                                          software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

                                                                                                                                                                          91

                                                                                                                                                                          Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                                                                          ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                                                                          ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                                                                          bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                                                                          bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                                                                          93

                                                                                                                                                                          Example Classical Conjugate Gradient (CG)

                                                                                                                                                                          SpMVs and dot products require communication in

                                                                                                                                                                          each iteration

                                                                                                                                                                          via CA Matrix Powers Kernel

                                                                                                                                                                          Global reduction to compute G

                                                                                                                                                                          94

                                                                                                                                                                          Example CA-Conjugate Gradient

                                                                                                                                                                          Local computations within inner loop require

                                                                                                                                                                          no communication

                                                                                                                                                                          Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                                                                          ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                                                                          ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                                                                          bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                                                                          bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                                                                          96

                                                                                                                                                                          Slower convergence due

                                                                                                                                                                          to roundoff

                                                                                                                                                                          Loss of accuracy due to roundoff

                                                                                                                                                                          At s = 16 monomial basis is rank deficient Method breaks down

                                                                                                                                                                          Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

                                                                                                                                                                          CA-CG (monomial)CG

                                                                                                                                                                          machine precision

                                                                                                                                                                          97

                                                                                                                                                                          Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                                                                          ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                                                                          ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                                                                          bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                                                                          bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                                                                          What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

                                                                                                                                                                          bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

                                                                                                                                                                          bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

                                                                                                                                                                          matrices

                                                                                                                                                                          Explicit (O(nnz)) Implicit (o(nnz))

                                                                                                                                                                          Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

                                                                                                                                                                          Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

                                                                                                                                                                          Indices

                                                                                                                                                                          Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                                                                          ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                                                                          ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                                                                          bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                                                                          bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                                                                          101

                                                                                                                                                                          bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

                                                                                                                                                                          ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

                                                                                                                                                                          demanded by customers (construction engineers) otherwise they donrsquot believe results

                                                                                                                                                                          ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

                                                                                                                                                                          ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

                                                                                                                                                                          Reproducible Floating Point Computation

                                                                                                                                                                          Absolute Error for Random Vectors

                                                                                                                                                                          Same magnitude opposite signs

                                                                                                                                                                          Intel MKL non-reproducibility

                                                                                                                                                                          Relative Error for Orthogonal vectors

                                                                                                                                                                          Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

                                                                                                                                                                          Sign notreproducible

                                                                                                                                                                          103

                                                                                                                                                                          bull Consider summation or dot productbull Goals

                                                                                                                                                                          1 Same answer independent of layout processors order of summands

                                                                                                                                                                          2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

                                                                                                                                                                          bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

                                                                                                                                                                          GoalsApproaches for Reproducibility

                                                                                                                                                                          104

                                                                                                                                                                          Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

                                                                                                                                                                          Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

                                                                                                                                                                          Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

                                                                                                                                                                          bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

                                                                                                                                                                          Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

                                                                                                                                                                          bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

                                                                                                                                                                          Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

                                                                                                                                                                          Instruments NEC Nokia NVIDIA Samsung Oracle

                                                                                                                                                                          bull bebopcsberkeleyedu

                                                                                                                                                                          Summary

                                                                                                                                                                          Donrsquot Communichellip

                                                                                                                                                                          106

                                                                                                                                                                          Time to redesign all linear algebra n-body hellip algorithms and software

                                                                                                                                                                          (and compilers)

                                                                                                                                                                          • Implementing Communication-Avoiding Algorithms
                                                                                                                                                                          • Why avoid communication
                                                                                                                                                                          • Goals
                                                                                                                                                                          • Outline
                                                                                                                                                                          • Outline (2)
                                                                                                                                                                          • Lower bound for all ldquon3-likerdquo linear algebra
                                                                                                                                                                          • Lower bound for all ldquon3-likerdquo linear algebra (2)
                                                                                                                                                                          • Lower bound for all ldquon3-likerdquo linear algebra (3)
                                                                                                                                                                          • Limits to parallel scaling (12)
                                                                                                                                                                          • Limits to parallel scaling (22)
                                                                                                                                                                          • Can we attain these lower bounds
                                                                                                                                                                          • Outline (3)
                                                                                                                                                                          • 25D Matrix Multiplication
                                                                                                                                                                          • 25D Matrix Multiplication (2)
                                                                                                                                                                          • 25D Matmul on BGP 16K nodes 64K cores (2)
                                                                                                                                                                          • Perfect Strong Scaling ndash in Time and Energy (12)
                                                                                                                                                                          • Perfect Strong Scaling ndash in Time and Energy (22)
                                                                                                                                                                          • Handling Heterogeneity
                                                                                                                                                                          • Application to Tensor Contractions
                                                                                                                                                                          • C(ijk) = Σm A(ijm)B(mk)
                                                                                                                                                                          • Application to Tensor Contractions (2)
                                                                                                                                                                          • Communication Lower Bounds for Strassen-like matmul algorithms
                                                                                                                                                                          • vs
                                                                                                                                                                          • Slide 26
                                                                                                                                                                          • Strassen-like beyond matmul
                                                                                                                                                                          • Cache and Network Oblivious Algorithms
                                                                                                                                                                          • CARMA Performance Distributed Memory
                                                                                                                                                                          • CARMA Performance Distributed Memory (2)
                                                                                                                                                                          • CARMA Performance Shared Memory
                                                                                                                                                                          • CARMA Performance Shared Memory (2)
                                                                                                                                                                          • Why is CARMA Faster in Shared Memory
                                                                                                                                                                          • Outline (4)
                                                                                                                                                                          • One-sided Factorizations (LU QR) so far
                                                                                                                                                                          • TSQR An Architecture-Dependent Algorithm
                                                                                                                                                                          • Back to LU Using similar idea for TSLU as TSQR Use reduction
                                                                                                                                                                          • Minimizing Communication in TSLU
                                                                                                                                                                          • Making TSLU Numerically Stable
                                                                                                                                                                          • Stability of LU using TSLU CALU
                                                                                                                                                                          • Why is stability of TSLU just a ldquoThmrdquo
                                                                                                                                                                          • Fixing TSLU
                                                                                                                                                                          • 2D CALU with Tournament Pivoting
                                                                                                                                                                          • 25D CALU with Tournament Pivoting (c=4 copies)
                                                                                                                                                                          • Exascale Machine Parameters Source DOE Exascale Workshop
                                                                                                                                                                          • Exascale predicted speedups for Gaussian Elimination 2D CA
                                                                                                                                                                          • 25D vs 2D LU With and Without Pivoting
                                                                                                                                                                          • Other CA algorithms for Ax=b least squares(13)
                                                                                                                                                                          • Other CA algorithms for Ax=b least squares (23)
                                                                                                                                                                          • Other CA algorithms for Ax=b least squares (33)
                                                                                                                                                                          • Outline (5)
                                                                                                                                                                          • What about sparse matrices (13)
                                                                                                                                                                          • Performance of 25D APSP using Kleene
                                                                                                                                                                          • What about sparse matrices (23)
                                                                                                                                                                          • What about sparse matrices (33)
                                                                                                                                                                          • Outline (6)
                                                                                                                                                                          • Symmetric Eigenproblem and SVD
                                                                                                                                                                          • Slide 58
                                                                                                                                                                          • Slide 59
                                                                                                                                                                          • Slide 60
                                                                                                                                                                          • Slide 61
                                                                                                                                                                          • Slide 62
                                                                                                                                                                          • Slide 63
                                                                                                                                                                          • Slide 64
                                                                                                                                                                          • Slide 65
                                                                                                                                                                          • Slide 66
                                                                                                                                                                          • Slide 67
                                                                                                                                                                          • Slide 68
                                                                                                                                                                          • Conventional vs CA - SBR
                                                                                                                                                                          • Speedups of Sym Band Reduction vs DSBTRD
                                                                                                                                                                          • Nonsymmetric Eigenproblem
                                                                                                                                                                          • Attaining the Lower bounds Sequential
                                                                                                                                                                          • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
                                                                                                                                                                          • Outline (7)
                                                                                                                                                                          • Avoiding Communication in Iterative Linear Algebra
                                                                                                                                                                          • Outline (8)
                                                                                                                                                                          • Example The Difficulty of Tuning SpMV
                                                                                                                                                                          • Example The Difficulty of Tuning
                                                                                                                                                                          • Speedups on Itanium 2 The Need for Search
                                                                                                                                                                          • Register Profile Itanium 2
                                                                                                                                                                          • Register Profiles IBM and Intel IA-64
                                                                                                                                                                          • Another example of tuning challenges for SpMV
                                                                                                                                                                          • Zoom in to top corner
                                                                                                                                                                          • 3x3 blocks look natural buthellip
                                                                                                                                                                          • Extra Work Can Improve Efficiency
                                                                                                                                                                          • Slide 86
                                                                                                                                                                          • Slide 87
                                                                                                                                                                          • Slide 88
                                                                                                                                                                          • Slide 89
                                                                                                                                                                          • Summary of Other Performance Optimizations
                                                                                                                                                                          • Optimized Sparse Kernel Interface - OSKI
                                                                                                                                                                          • Outline (9)
                                                                                                                                                                          • Example Classical Conjugate Gradient (CG)
                                                                                                                                                                          • Example CA-Conjugate Gradient
                                                                                                                                                                          • Outline (10)
                                                                                                                                                                          • Slide 96
                                                                                                                                                                          • Slide 97
                                                                                                                                                                          • Outline (11)
                                                                                                                                                                          • What is a ldquosparse matrixrdquo
                                                                                                                                                                          • Outline (12)
                                                                                                                                                                          • Reproducible Floating Point Computation
                                                                                                                                                                          • Intel MKL non-reproducibility
                                                                                                                                                                          • GoalsApproaches for Reproducibility
                                                                                                                                                                          • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
                                                                                                                                                                          • Collaborators and Supporters
                                                                                                                                                                          • Summary

                                                                                                                                                                            Post-RCM Reordering

                                                                                                                                                                            88

                                                                                                                                                                            Effect of Combined RCM+TSP Reordering

                                                                                                                                                                            Before Green + RedAfter Green + Blue

                                                                                                                                                                            Summer School Lecture 789

                                                                                                                                                                            2x speedups on Pentium 4 Power 4 hellip

                                                                                                                                                                            Summary of Other Performance Optimizations

                                                                                                                                                                            bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

                                                                                                                                                                            bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

                                                                                                                                                                            bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

                                                                                                                                                                            90

                                                                                                                                                                            Optimized Sparse Kernel Interface - OSKI

                                                                                                                                                                            bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

                                                                                                                                                                            bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

                                                                                                                                                                            bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

                                                                                                                                                                            software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

                                                                                                                                                                            91

                                                                                                                                                                            Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                                                                            ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                                                                            ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                                                                            bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                                                                            bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                                                                            93

                                                                                                                                                                            Example Classical Conjugate Gradient (CG)

                                                                                                                                                                            SpMVs and dot products require communication in

                                                                                                                                                                            each iteration

                                                                                                                                                                            via CA Matrix Powers Kernel

                                                                                                                                                                            Global reduction to compute G

                                                                                                                                                                            94

                                                                                                                                                                            Example CA-Conjugate Gradient

                                                                                                                                                                            Local computations within inner loop require

                                                                                                                                                                            no communication

                                                                                                                                                                            Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                                                                            ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                                                                            ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                                                                            bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                                                                            bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                                                                            96

                                                                                                                                                                            Slower convergence due

                                                                                                                                                                            to roundoff

                                                                                                                                                                            Loss of accuracy due to roundoff

                                                                                                                                                                            At s = 16 monomial basis is rank deficient Method breaks down

                                                                                                                                                                            Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

                                                                                                                                                                            CA-CG (monomial)CG

                                                                                                                                                                            machine precision

                                                                                                                                                                            97

                                                                                                                                                                            Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                                                                            ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                                                                            ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                                                                            bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                                                                            bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                                                                            What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

                                                                                                                                                                            bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

                                                                                                                                                                            bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

                                                                                                                                                                            matrices

                                                                                                                                                                            Explicit (O(nnz)) Implicit (o(nnz))

                                                                                                                                                                            Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

                                                                                                                                                                            Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

                                                                                                                                                                            Indices

                                                                                                                                                                            Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                                                                            ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                                                                            ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                                                                            bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                                                                            bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                                                                            101

                                                                                                                                                                            bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

                                                                                                                                                                            ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

                                                                                                                                                                            demanded by customers (construction engineers) otherwise they donrsquot believe results

                                                                                                                                                                            ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

                                                                                                                                                                            ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

                                                                                                                                                                            Reproducible Floating Point Computation

                                                                                                                                                                            Absolute Error for Random Vectors

                                                                                                                                                                            Same magnitude opposite signs

                                                                                                                                                                            Intel MKL non-reproducibility

                                                                                                                                                                            Relative Error for Orthogonal vectors

                                                                                                                                                                            Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

                                                                                                                                                                            Sign notreproducible

                                                                                                                                                                            103

                                                                                                                                                                            bull Consider summation or dot productbull Goals

                                                                                                                                                                            1 Same answer independent of layout processors order of summands

                                                                                                                                                                            2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

                                                                                                                                                                            bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

                                                                                                                                                                            GoalsApproaches for Reproducibility

                                                                                                                                                                            104

                                                                                                                                                                            Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

                                                                                                                                                                            Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

                                                                                                                                                                            Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

                                                                                                                                                                            bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

                                                                                                                                                                            Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

                                                                                                                                                                            bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

                                                                                                                                                                            Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

                                                                                                                                                                            Instruments NEC Nokia NVIDIA Samsung Oracle

                                                                                                                                                                            bull bebopcsberkeleyedu

                                                                                                                                                                            Summary

                                                                                                                                                                            Donrsquot Communichellip

                                                                                                                                                                            106

                                                                                                                                                                            Time to redesign all linear algebra n-body hellip algorithms and software

                                                                                                                                                                            (and compilers)

                                                                                                                                                                            • Implementing Communication-Avoiding Algorithms
                                                                                                                                                                            • Why avoid communication
                                                                                                                                                                            • Goals
                                                                                                                                                                            • Outline
                                                                                                                                                                            • Outline (2)
                                                                                                                                                                            • Lower bound for all ldquon3-likerdquo linear algebra
                                                                                                                                                                            • Lower bound for all ldquon3-likerdquo linear algebra (2)
                                                                                                                                                                            • Lower bound for all ldquon3-likerdquo linear algebra (3)
                                                                                                                                                                            • Limits to parallel scaling (12)
                                                                                                                                                                            • Limits to parallel scaling (22)
                                                                                                                                                                            • Can we attain these lower bounds
                                                                                                                                                                            • Outline (3)
                                                                                                                                                                            • 25D Matrix Multiplication
                                                                                                                                                                            • 25D Matrix Multiplication (2)
                                                                                                                                                                            • 25D Matmul on BGP 16K nodes 64K cores (2)
                                                                                                                                                                            • Perfect Strong Scaling ndash in Time and Energy (12)
                                                                                                                                                                            • Perfect Strong Scaling ndash in Time and Energy (22)
                                                                                                                                                                            • Handling Heterogeneity
                                                                                                                                                                            • Application to Tensor Contractions
                                                                                                                                                                            • C(ijk) = Σm A(ijm)B(mk)
                                                                                                                                                                            • Application to Tensor Contractions (2)
                                                                                                                                                                            • Communication Lower Bounds for Strassen-like matmul algorithms
                                                                                                                                                                            • vs
                                                                                                                                                                            • Slide 26
                                                                                                                                                                            • Strassen-like beyond matmul
                                                                                                                                                                            • Cache and Network Oblivious Algorithms
                                                                                                                                                                            • CARMA Performance Distributed Memory
                                                                                                                                                                            • CARMA Performance Distributed Memory (2)
                                                                                                                                                                            • CARMA Performance Shared Memory
                                                                                                                                                                            • CARMA Performance Shared Memory (2)
                                                                                                                                                                            • Why is CARMA Faster in Shared Memory
                                                                                                                                                                            • Outline (4)
                                                                                                                                                                            • One-sided Factorizations (LU QR) so far
                                                                                                                                                                            • TSQR An Architecture-Dependent Algorithm
                                                                                                                                                                            • Back to LU Using similar idea for TSLU as TSQR Use reduction
                                                                                                                                                                            • Minimizing Communication in TSLU
                                                                                                                                                                            • Making TSLU Numerically Stable
                                                                                                                                                                            • Stability of LU using TSLU CALU
                                                                                                                                                                            • Why is stability of TSLU just a ldquoThmrdquo
                                                                                                                                                                            • Fixing TSLU
                                                                                                                                                                            • 2D CALU with Tournament Pivoting
                                                                                                                                                                            • 25D CALU with Tournament Pivoting (c=4 copies)
                                                                                                                                                                            • Exascale Machine Parameters Source DOE Exascale Workshop
                                                                                                                                                                            • Exascale predicted speedups for Gaussian Elimination 2D CA
                                                                                                                                                                            • 25D vs 2D LU With and Without Pivoting
                                                                                                                                                                            • Other CA algorithms for Ax=b least squares(13)
                                                                                                                                                                            • Other CA algorithms for Ax=b least squares (23)
                                                                                                                                                                            • Other CA algorithms for Ax=b least squares (33)
                                                                                                                                                                            • Outline (5)
                                                                                                                                                                            • What about sparse matrices (13)
                                                                                                                                                                            • Performance of 25D APSP using Kleene
                                                                                                                                                                            • What about sparse matrices (23)
                                                                                                                                                                            • What about sparse matrices (33)
                                                                                                                                                                            • Outline (6)
                                                                                                                                                                            • Symmetric Eigenproblem and SVD
                                                                                                                                                                            • Slide 58
                                                                                                                                                                            • Slide 59
                                                                                                                                                                            • Slide 60
                                                                                                                                                                            • Slide 61
                                                                                                                                                                            • Slide 62
                                                                                                                                                                            • Slide 63
                                                                                                                                                                            • Slide 64
                                                                                                                                                                            • Slide 65
                                                                                                                                                                            • Slide 66
                                                                                                                                                                            • Slide 67
                                                                                                                                                                            • Slide 68
                                                                                                                                                                            • Conventional vs CA - SBR
                                                                                                                                                                            • Speedups of Sym Band Reduction vs DSBTRD
                                                                                                                                                                            • Nonsymmetric Eigenproblem
                                                                                                                                                                            • Attaining the Lower bounds Sequential
                                                                                                                                                                            • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
                                                                                                                                                                            • Outline (7)
                                                                                                                                                                            • Avoiding Communication in Iterative Linear Algebra
                                                                                                                                                                            • Outline (8)
                                                                                                                                                                            • Example The Difficulty of Tuning SpMV
                                                                                                                                                                            • Example The Difficulty of Tuning
                                                                                                                                                                            • Speedups on Itanium 2 The Need for Search
                                                                                                                                                                            • Register Profile Itanium 2
                                                                                                                                                                            • Register Profiles IBM and Intel IA-64
                                                                                                                                                                            • Another example of tuning challenges for SpMV
                                                                                                                                                                            • Zoom in to top corner
                                                                                                                                                                            • 3x3 blocks look natural buthellip
                                                                                                                                                                            • Extra Work Can Improve Efficiency
                                                                                                                                                                            • Slide 86
                                                                                                                                                                            • Slide 87
                                                                                                                                                                            • Slide 88
                                                                                                                                                                            • Slide 89
                                                                                                                                                                            • Summary of Other Performance Optimizations
                                                                                                                                                                            • Optimized Sparse Kernel Interface - OSKI
                                                                                                                                                                            • Outline (9)
                                                                                                                                                                            • Example Classical Conjugate Gradient (CG)
                                                                                                                                                                            • Example CA-Conjugate Gradient
                                                                                                                                                                            • Outline (10)
                                                                                                                                                                            • Slide 96
                                                                                                                                                                            • Slide 97
                                                                                                                                                                            • Outline (11)
                                                                                                                                                                            • What is a ldquosparse matrixrdquo
                                                                                                                                                                            • Outline (12)
                                                                                                                                                                            • Reproducible Floating Point Computation
                                                                                                                                                                            • Intel MKL non-reproducibility
                                                                                                                                                                            • GoalsApproaches for Reproducibility
                                                                                                                                                                            • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
                                                                                                                                                                            • Collaborators and Supporters
                                                                                                                                                                            • Summary

                                                                                                                                                                              Effect of Combined RCM+TSP Reordering

                                                                                                                                                                              Before Green + RedAfter Green + Blue

                                                                                                                                                                              Summer School Lecture 789

                                                                                                                                                                              2x speedups on Pentium 4 Power 4 hellip

                                                                                                                                                                              Summary of Other Performance Optimizations

                                                                                                                                                                              bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

                                                                                                                                                                              bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

                                                                                                                                                                              bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

                                                                                                                                                                              90

                                                                                                                                                                              Optimized Sparse Kernel Interface - OSKI

                                                                                                                                                                              bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

                                                                                                                                                                              bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

                                                                                                                                                                              bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

                                                                                                                                                                              software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

                                                                                                                                                                              91

                                                                                                                                                                              Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                                                                              ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                                                                              ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                                                                              bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                                                                              bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                                                                              93

                                                                                                                                                                              Example Classical Conjugate Gradient (CG)

                                                                                                                                                                              SpMVs and dot products require communication in

                                                                                                                                                                              each iteration

                                                                                                                                                                              via CA Matrix Powers Kernel

                                                                                                                                                                              Global reduction to compute G

                                                                                                                                                                              94

                                                                                                                                                                              Example CA-Conjugate Gradient

                                                                                                                                                                              Local computations within inner loop require

                                                                                                                                                                              no communication

                                                                                                                                                                              Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                                                                              ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                                                                              ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                                                                              bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                                                                              bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                                                                              96

                                                                                                                                                                              Slower convergence due

                                                                                                                                                                              to roundoff

                                                                                                                                                                              Loss of accuracy due to roundoff

                                                                                                                                                                              At s = 16 monomial basis is rank deficient Method breaks down

                                                                                                                                                                              Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

                                                                                                                                                                              CA-CG (monomial)CG

                                                                                                                                                                              machine precision

                                                                                                                                                                              97

                                                                                                                                                                              Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                                                                              ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                                                                              ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                                                                              bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                                                                              bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                                                                              What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

                                                                                                                                                                              bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

                                                                                                                                                                              bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

                                                                                                                                                                              matrices

                                                                                                                                                                              Explicit (O(nnz)) Implicit (o(nnz))

                                                                                                                                                                              Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

                                                                                                                                                                              Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

                                                                                                                                                                              Indices

                                                                                                                                                                              Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                                                                              ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                                                                              ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                                                                              bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                                                                              bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                                                                              101

                                                                                                                                                                              bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

                                                                                                                                                                              ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

                                                                                                                                                                              demanded by customers (construction engineers) otherwise they donrsquot believe results

                                                                                                                                                                              ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

                                                                                                                                                                              ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

                                                                                                                                                                              Reproducible Floating Point Computation

                                                                                                                                                                              Absolute Error for Random Vectors

                                                                                                                                                                              Same magnitude opposite signs

                                                                                                                                                                              Intel MKL non-reproducibility

                                                                                                                                                                              Relative Error for Orthogonal vectors

                                                                                                                                                                              Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

                                                                                                                                                                              Sign notreproducible

                                                                                                                                                                              103

                                                                                                                                                                              bull Consider summation or dot productbull Goals

                                                                                                                                                                              1 Same answer independent of layout processors order of summands

                                                                                                                                                                              2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

                                                                                                                                                                              bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

                                                                                                                                                                              GoalsApproaches for Reproducibility

                                                                                                                                                                              104

                                                                                                                                                                              Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

                                                                                                                                                                              Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

                                                                                                                                                                              Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

                                                                                                                                                                              bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

                                                                                                                                                                              Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

                                                                                                                                                                              bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

                                                                                                                                                                              Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

                                                                                                                                                                              Instruments NEC Nokia NVIDIA Samsung Oracle

                                                                                                                                                                              bull bebopcsberkeleyedu

                                                                                                                                                                              Summary

                                                                                                                                                                              Donrsquot Communichellip

                                                                                                                                                                              106

                                                                                                                                                                              Time to redesign all linear algebra n-body hellip algorithms and software

                                                                                                                                                                              (and compilers)

                                                                                                                                                                              • Implementing Communication-Avoiding Algorithms
                                                                                                                                                                              • Why avoid communication
                                                                                                                                                                              • Goals
                                                                                                                                                                              • Outline
                                                                                                                                                                              • Outline (2)
                                                                                                                                                                              • Lower bound for all ldquon3-likerdquo linear algebra
                                                                                                                                                                              • Lower bound for all ldquon3-likerdquo linear algebra (2)
                                                                                                                                                                              • Lower bound for all ldquon3-likerdquo linear algebra (3)
                                                                                                                                                                              • Limits to parallel scaling (12)
                                                                                                                                                                              • Limits to parallel scaling (22)
                                                                                                                                                                              • Can we attain these lower bounds
                                                                                                                                                                              • Outline (3)
                                                                                                                                                                              • 25D Matrix Multiplication
                                                                                                                                                                              • 25D Matrix Multiplication (2)
                                                                                                                                                                              • 25D Matmul on BGP 16K nodes 64K cores (2)
                                                                                                                                                                              • Perfect Strong Scaling ndash in Time and Energy (12)
                                                                                                                                                                              • Perfect Strong Scaling ndash in Time and Energy (22)
                                                                                                                                                                              • Handling Heterogeneity
                                                                                                                                                                              • Application to Tensor Contractions
                                                                                                                                                                              • C(ijk) = Σm A(ijm)B(mk)
                                                                                                                                                                              • Application to Tensor Contractions (2)
                                                                                                                                                                              • Communication Lower Bounds for Strassen-like matmul algorithms
                                                                                                                                                                              • vs
                                                                                                                                                                              • Slide 26
                                                                                                                                                                              • Strassen-like beyond matmul
                                                                                                                                                                              • Cache and Network Oblivious Algorithms
                                                                                                                                                                              • CARMA Performance Distributed Memory
                                                                                                                                                                              • CARMA Performance Distributed Memory (2)
                                                                                                                                                                              • CARMA Performance Shared Memory
                                                                                                                                                                              • CARMA Performance Shared Memory (2)
                                                                                                                                                                              • Why is CARMA Faster in Shared Memory
                                                                                                                                                                              • Outline (4)
                                                                                                                                                                              • One-sided Factorizations (LU QR) so far
                                                                                                                                                                              • TSQR An Architecture-Dependent Algorithm
                                                                                                                                                                              • Back to LU Using similar idea for TSLU as TSQR Use reduction
                                                                                                                                                                              • Minimizing Communication in TSLU
                                                                                                                                                                              • Making TSLU Numerically Stable
                                                                                                                                                                              • Stability of LU using TSLU CALU
                                                                                                                                                                              • Why is stability of TSLU just a ldquoThmrdquo
                                                                                                                                                                              • Fixing TSLU
                                                                                                                                                                              • 2D CALU with Tournament Pivoting
                                                                                                                                                                              • 25D CALU with Tournament Pivoting (c=4 copies)
                                                                                                                                                                              • Exascale Machine Parameters Source DOE Exascale Workshop
                                                                                                                                                                              • Exascale predicted speedups for Gaussian Elimination 2D CA
                                                                                                                                                                              • 25D vs 2D LU With and Without Pivoting
                                                                                                                                                                              • Other CA algorithms for Ax=b least squares(13)
                                                                                                                                                                              • Other CA algorithms for Ax=b least squares (23)
                                                                                                                                                                              • Other CA algorithms for Ax=b least squares (33)
                                                                                                                                                                              • Outline (5)
                                                                                                                                                                              • What about sparse matrices (13)
                                                                                                                                                                              • Performance of 25D APSP using Kleene
                                                                                                                                                                              • What about sparse matrices (23)
                                                                                                                                                                              • What about sparse matrices (33)
                                                                                                                                                                              • Outline (6)
                                                                                                                                                                              • Symmetric Eigenproblem and SVD
                                                                                                                                                                              • Slide 58
                                                                                                                                                                              • Slide 59
                                                                                                                                                                              • Slide 60
                                                                                                                                                                              • Slide 61
                                                                                                                                                                              • Slide 62
                                                                                                                                                                              • Slide 63
                                                                                                                                                                              • Slide 64
                                                                                                                                                                              • Slide 65
                                                                                                                                                                              • Slide 66
                                                                                                                                                                              • Slide 67
                                                                                                                                                                              • Slide 68
                                                                                                                                                                              • Conventional vs CA - SBR
                                                                                                                                                                              • Speedups of Sym Band Reduction vs DSBTRD
                                                                                                                                                                              • Nonsymmetric Eigenproblem
                                                                                                                                                                              • Attaining the Lower bounds Sequential
                                                                                                                                                                              • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
                                                                                                                                                                              • Outline (7)
                                                                                                                                                                              • Avoiding Communication in Iterative Linear Algebra
                                                                                                                                                                              • Outline (8)
                                                                                                                                                                              • Example The Difficulty of Tuning SpMV
                                                                                                                                                                              • Example The Difficulty of Tuning
                                                                                                                                                                              • Speedups on Itanium 2 The Need for Search
                                                                                                                                                                              • Register Profile Itanium 2
                                                                                                                                                                              • Register Profiles IBM and Intel IA-64
                                                                                                                                                                              • Another example of tuning challenges for SpMV
                                                                                                                                                                              • Zoom in to top corner
                                                                                                                                                                              • 3x3 blocks look natural buthellip
                                                                                                                                                                              • Extra Work Can Improve Efficiency
                                                                                                                                                                              • Slide 86
                                                                                                                                                                              • Slide 87
                                                                                                                                                                              • Slide 88
                                                                                                                                                                              • Slide 89
                                                                                                                                                                              • Summary of Other Performance Optimizations
                                                                                                                                                                              • Optimized Sparse Kernel Interface - OSKI
                                                                                                                                                                              • Outline (9)
                                                                                                                                                                              • Example Classical Conjugate Gradient (CG)
                                                                                                                                                                              • Example CA-Conjugate Gradient
                                                                                                                                                                              • Outline (10)
                                                                                                                                                                              • Slide 96
                                                                                                                                                                              • Slide 97
                                                                                                                                                                              • Outline (11)
                                                                                                                                                                              • What is a ldquosparse matrixrdquo
                                                                                                                                                                              • Outline (12)
                                                                                                                                                                              • Reproducible Floating Point Computation
                                                                                                                                                                              • Intel MKL non-reproducibility
                                                                                                                                                                              • GoalsApproaches for Reproducibility
                                                                                                                                                                              • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
                                                                                                                                                                              • Collaborators and Supporters
                                                                                                                                                                              • Summary

                                                                                                                                                                                Summary of Other Performance Optimizations

                                                                                                                                                                                bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

                                                                                                                                                                                bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

                                                                                                                                                                                bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

                                                                                                                                                                                90

                                                                                                                                                                                Optimized Sparse Kernel Interface - OSKI

                                                                                                                                                                                bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

                                                                                                                                                                                bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

                                                                                                                                                                                bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

                                                                                                                                                                                software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

                                                                                                                                                                                91

                                                                                                                                                                                Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                                                                                ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                                                                                ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                                                                                bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                                                                                bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                                                                                93

                                                                                                                                                                                Example Classical Conjugate Gradient (CG)

                                                                                                                                                                                SpMVs and dot products require communication in

                                                                                                                                                                                each iteration

                                                                                                                                                                                via CA Matrix Powers Kernel

                                                                                                                                                                                Global reduction to compute G

                                                                                                                                                                                94

                                                                                                                                                                                Example CA-Conjugate Gradient

                                                                                                                                                                                Local computations within inner loop require

                                                                                                                                                                                no communication

                                                                                                                                                                                Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                                                                                ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                                                                                ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                                                                                bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                                                                                bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                                                                                96

                                                                                                                                                                                Slower convergence due

                                                                                                                                                                                to roundoff

                                                                                                                                                                                Loss of accuracy due to roundoff

                                                                                                                                                                                At s = 16 monomial basis is rank deficient Method breaks down

                                                                                                                                                                                Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

                                                                                                                                                                                CA-CG (monomial)CG

                                                                                                                                                                                machine precision

                                                                                                                                                                                97

                                                                                                                                                                                Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                                                                                ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                                                                                ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                                                                                bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                                                                                bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                                                                                What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

                                                                                                                                                                                bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

                                                                                                                                                                                bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

                                                                                                                                                                                matrices

                                                                                                                                                                                Explicit (O(nnz)) Implicit (o(nnz))

                                                                                                                                                                                Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

                                                                                                                                                                                Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

                                                                                                                                                                                Indices

                                                                                                                                                                                Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                                                                                ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                                                                                ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                                                                                bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                                                                                bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                                                                                101

                                                                                                                                                                                bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

                                                                                                                                                                                ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

                                                                                                                                                                                demanded by customers (construction engineers) otherwise they donrsquot believe results

                                                                                                                                                                                ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

                                                                                                                                                                                ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

                                                                                                                                                                                Reproducible Floating Point Computation

                                                                                                                                                                                Absolute Error for Random Vectors

                                                                                                                                                                                Same magnitude opposite signs

                                                                                                                                                                                Intel MKL non-reproducibility

                                                                                                                                                                                Relative Error for Orthogonal vectors

                                                                                                                                                                                Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

                                                                                                                                                                                Sign notreproducible

                                                                                                                                                                                103

                                                                                                                                                                                bull Consider summation or dot productbull Goals

                                                                                                                                                                                1 Same answer independent of layout processors order of summands

                                                                                                                                                                                2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

                                                                                                                                                                                bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

                                                                                                                                                                                GoalsApproaches for Reproducibility

                                                                                                                                                                                104

                                                                                                                                                                                Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

                                                                                                                                                                                Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

                                                                                                                                                                                Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

                                                                                                                                                                                bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

                                                                                                                                                                                Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

                                                                                                                                                                                bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

                                                                                                                                                                                Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

                                                                                                                                                                                Instruments NEC Nokia NVIDIA Samsung Oracle

                                                                                                                                                                                bull bebopcsberkeleyedu

                                                                                                                                                                                Summary

                                                                                                                                                                                Donrsquot Communichellip

                                                                                                                                                                                106

                                                                                                                                                                                Time to redesign all linear algebra n-body hellip algorithms and software

                                                                                                                                                                                (and compilers)

                                                                                                                                                                                • Implementing Communication-Avoiding Algorithms
                                                                                                                                                                                • Why avoid communication
                                                                                                                                                                                • Goals
                                                                                                                                                                                • Outline
                                                                                                                                                                                • Outline (2)
                                                                                                                                                                                • Lower bound for all ldquon3-likerdquo linear algebra
                                                                                                                                                                                • Lower bound for all ldquon3-likerdquo linear algebra (2)
                                                                                                                                                                                • Lower bound for all ldquon3-likerdquo linear algebra (3)
                                                                                                                                                                                • Limits to parallel scaling (12)
                                                                                                                                                                                • Limits to parallel scaling (22)
                                                                                                                                                                                • Can we attain these lower bounds
                                                                                                                                                                                • Outline (3)
                                                                                                                                                                                • 25D Matrix Multiplication
                                                                                                                                                                                • 25D Matrix Multiplication (2)
                                                                                                                                                                                • 25D Matmul on BGP 16K nodes 64K cores (2)
                                                                                                                                                                                • Perfect Strong Scaling ndash in Time and Energy (12)
                                                                                                                                                                                • Perfect Strong Scaling ndash in Time and Energy (22)
                                                                                                                                                                                • Handling Heterogeneity
                                                                                                                                                                                • Application to Tensor Contractions
                                                                                                                                                                                • C(ijk) = Σm A(ijm)B(mk)
                                                                                                                                                                                • Application to Tensor Contractions (2)
                                                                                                                                                                                • Communication Lower Bounds for Strassen-like matmul algorithms
                                                                                                                                                                                • vs
                                                                                                                                                                                • Slide 26
                                                                                                                                                                                • Strassen-like beyond matmul
                                                                                                                                                                                • Cache and Network Oblivious Algorithms
                                                                                                                                                                                • CARMA Performance Distributed Memory
                                                                                                                                                                                • CARMA Performance Distributed Memory (2)
                                                                                                                                                                                • CARMA Performance Shared Memory
                                                                                                                                                                                • CARMA Performance Shared Memory (2)
                                                                                                                                                                                • Why is CARMA Faster in Shared Memory
                                                                                                                                                                                • Outline (4)
                                                                                                                                                                                • One-sided Factorizations (LU QR) so far
                                                                                                                                                                                • TSQR An Architecture-Dependent Algorithm
                                                                                                                                                                                • Back to LU Using similar idea for TSLU as TSQR Use reduction
                                                                                                                                                                                • Minimizing Communication in TSLU
                                                                                                                                                                                • Making TSLU Numerically Stable
                                                                                                                                                                                • Stability of LU using TSLU CALU
                                                                                                                                                                                • Why is stability of TSLU just a ldquoThmrdquo
                                                                                                                                                                                • Fixing TSLU
                                                                                                                                                                                • 2D CALU with Tournament Pivoting
                                                                                                                                                                                • 25D CALU with Tournament Pivoting (c=4 copies)
                                                                                                                                                                                • Exascale Machine Parameters Source DOE Exascale Workshop
                                                                                                                                                                                • Exascale predicted speedups for Gaussian Elimination 2D CA
                                                                                                                                                                                • 25D vs 2D LU With and Without Pivoting
                                                                                                                                                                                • Other CA algorithms for Ax=b least squares(13)
                                                                                                                                                                                • Other CA algorithms for Ax=b least squares (23)
                                                                                                                                                                                • Other CA algorithms for Ax=b least squares (33)
                                                                                                                                                                                • Outline (5)
                                                                                                                                                                                • What about sparse matrices (13)
                                                                                                                                                                                • Performance of 25D APSP using Kleene
                                                                                                                                                                                • What about sparse matrices (23)
                                                                                                                                                                                • What about sparse matrices (33)
                                                                                                                                                                                • Outline (6)
                                                                                                                                                                                • Symmetric Eigenproblem and SVD
                                                                                                                                                                                • Slide 58
                                                                                                                                                                                • Slide 59
                                                                                                                                                                                • Slide 60
                                                                                                                                                                                • Slide 61
                                                                                                                                                                                • Slide 62
                                                                                                                                                                                • Slide 63
                                                                                                                                                                                • Slide 64
                                                                                                                                                                                • Slide 65
                                                                                                                                                                                • Slide 66
                                                                                                                                                                                • Slide 67
                                                                                                                                                                                • Slide 68
                                                                                                                                                                                • Conventional vs CA - SBR
                                                                                                                                                                                • Speedups of Sym Band Reduction vs DSBTRD
                                                                                                                                                                                • Nonsymmetric Eigenproblem
                                                                                                                                                                                • Attaining the Lower bounds Sequential
                                                                                                                                                                                • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
                                                                                                                                                                                • Outline (7)
                                                                                                                                                                                • Avoiding Communication in Iterative Linear Algebra
                                                                                                                                                                                • Outline (8)
                                                                                                                                                                                • Example The Difficulty of Tuning SpMV
                                                                                                                                                                                • Example The Difficulty of Tuning
                                                                                                                                                                                • Speedups on Itanium 2 The Need for Search
                                                                                                                                                                                • Register Profile Itanium 2
                                                                                                                                                                                • Register Profiles IBM and Intel IA-64
                                                                                                                                                                                • Another example of tuning challenges for SpMV
                                                                                                                                                                                • Zoom in to top corner
                                                                                                                                                                                • 3x3 blocks look natural buthellip
                                                                                                                                                                                • Extra Work Can Improve Efficiency
                                                                                                                                                                                • Slide 86
                                                                                                                                                                                • Slide 87
                                                                                                                                                                                • Slide 88
                                                                                                                                                                                • Slide 89
                                                                                                                                                                                • Summary of Other Performance Optimizations
                                                                                                                                                                                • Optimized Sparse Kernel Interface - OSKI
                                                                                                                                                                                • Outline (9)
                                                                                                                                                                                • Example Classical Conjugate Gradient (CG)
                                                                                                                                                                                • Example CA-Conjugate Gradient
                                                                                                                                                                                • Outline (10)
                                                                                                                                                                                • Slide 96
                                                                                                                                                                                • Slide 97
                                                                                                                                                                                • Outline (11)
                                                                                                                                                                                • What is a ldquosparse matrixrdquo
                                                                                                                                                                                • Outline (12)
                                                                                                                                                                                • Reproducible Floating Point Computation
                                                                                                                                                                                • Intel MKL non-reproducibility
                                                                                                                                                                                • GoalsApproaches for Reproducibility
                                                                                                                                                                                • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
                                                                                                                                                                                • Collaborators and Supporters
                                                                                                                                                                                • Summary

                                                                                                                                                                                  Optimized Sparse Kernel Interface - OSKI

                                                                                                                                                                                  bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

                                                                                                                                                                                  bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

                                                                                                                                                                                  bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

                                                                                                                                                                                  software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

                                                                                                                                                                                  91

                                                                                                                                                                                  Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                                                                                  ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                                                                                  ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                                                                                  bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                                                                                  bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                                                                                  93

                                                                                                                                                                                  Example Classical Conjugate Gradient (CG)

                                                                                                                                                                                  SpMVs and dot products require communication in

                                                                                                                                                                                  each iteration

                                                                                                                                                                                  via CA Matrix Powers Kernel

                                                                                                                                                                                  Global reduction to compute G

                                                                                                                                                                                  94

                                                                                                                                                                                  Example CA-Conjugate Gradient

                                                                                                                                                                                  Local computations within inner loop require

                                                                                                                                                                                  no communication

                                                                                                                                                                                  Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                                                                                  ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                                                                                  ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                                                                                  bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                                                                                  bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                                                                                  96

                                                                                                                                                                                  Slower convergence due

                                                                                                                                                                                  to roundoff

                                                                                                                                                                                  Loss of accuracy due to roundoff

                                                                                                                                                                                  At s = 16 monomial basis is rank deficient Method breaks down

                                                                                                                                                                                  Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

                                                                                                                                                                                  CA-CG (monomial)CG

                                                                                                                                                                                  machine precision

                                                                                                                                                                                  97

                                                                                                                                                                                  Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                                                                                  ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                                                                                  ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                                                                                  bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                                                                                  bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                                                                                  What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

                                                                                                                                                                                  bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

                                                                                                                                                                                  bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

                                                                                                                                                                                  matrices

                                                                                                                                                                                  Explicit (O(nnz)) Implicit (o(nnz))

                                                                                                                                                                                  Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

                                                                                                                                                                                  Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

                                                                                                                                                                                  Indices

                                                                                                                                                                                  Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                                                                                  ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                                                                                  ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                                                                                  bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                                                                                  bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                                                                                  101

                                                                                                                                                                                  bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

                                                                                                                                                                                  ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

                                                                                                                                                                                  demanded by customers (construction engineers) otherwise they donrsquot believe results

                                                                                                                                                                                  ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

                                                                                                                                                                                  ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

                                                                                                                                                                                  Reproducible Floating Point Computation

                                                                                                                                                                                  Absolute Error for Random Vectors

                                                                                                                                                                                  Same magnitude opposite signs

                                                                                                                                                                                  Intel MKL non-reproducibility

                                                                                                                                                                                  Relative Error for Orthogonal vectors

                                                                                                                                                                                  Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

                                                                                                                                                                                  Sign notreproducible

                                                                                                                                                                                  103

                                                                                                                                                                                  bull Consider summation or dot productbull Goals

                                                                                                                                                                                  1 Same answer independent of layout processors order of summands

                                                                                                                                                                                  2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

                                                                                                                                                                                  bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

                                                                                                                                                                                  GoalsApproaches for Reproducibility

                                                                                                                                                                                  104

                                                                                                                                                                                  Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

                                                                                                                                                                                  Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

                                                                                                                                                                                  Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

                                                                                                                                                                                  bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

                                                                                                                                                                                  Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

                                                                                                                                                                                  bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

                                                                                                                                                                                  Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

                                                                                                                                                                                  Instruments NEC Nokia NVIDIA Samsung Oracle

                                                                                                                                                                                  bull bebopcsberkeleyedu

                                                                                                                                                                                  Summary

                                                                                                                                                                                  Donrsquot Communichellip

                                                                                                                                                                                  106

                                                                                                                                                                                  Time to redesign all linear algebra n-body hellip algorithms and software

                                                                                                                                                                                  (and compilers)

                                                                                                                                                                                  • Implementing Communication-Avoiding Algorithms
                                                                                                                                                                                  • Why avoid communication
                                                                                                                                                                                  • Goals
                                                                                                                                                                                  • Outline
                                                                                                                                                                                  • Outline (2)
                                                                                                                                                                                  • Lower bound for all ldquon3-likerdquo linear algebra
                                                                                                                                                                                  • Lower bound for all ldquon3-likerdquo linear algebra (2)
                                                                                                                                                                                  • Lower bound for all ldquon3-likerdquo linear algebra (3)
                                                                                                                                                                                  • Limits to parallel scaling (12)
                                                                                                                                                                                  • Limits to parallel scaling (22)
                                                                                                                                                                                  • Can we attain these lower bounds
                                                                                                                                                                                  • Outline (3)
                                                                                                                                                                                  • 25D Matrix Multiplication
                                                                                                                                                                                  • 25D Matrix Multiplication (2)
                                                                                                                                                                                  • 25D Matmul on BGP 16K nodes 64K cores (2)
                                                                                                                                                                                  • Perfect Strong Scaling ndash in Time and Energy (12)
                                                                                                                                                                                  • Perfect Strong Scaling ndash in Time and Energy (22)
                                                                                                                                                                                  • Handling Heterogeneity
                                                                                                                                                                                  • Application to Tensor Contractions
                                                                                                                                                                                  • C(ijk) = Σm A(ijm)B(mk)
                                                                                                                                                                                  • Application to Tensor Contractions (2)
                                                                                                                                                                                  • Communication Lower Bounds for Strassen-like matmul algorithms
                                                                                                                                                                                  • vs
                                                                                                                                                                                  • Slide 26
                                                                                                                                                                                  • Strassen-like beyond matmul
                                                                                                                                                                                  • Cache and Network Oblivious Algorithms
                                                                                                                                                                                  • CARMA Performance Distributed Memory
                                                                                                                                                                                  • CARMA Performance Distributed Memory (2)
                                                                                                                                                                                  • CARMA Performance Shared Memory
                                                                                                                                                                                  • CARMA Performance Shared Memory (2)
                                                                                                                                                                                  • Why is CARMA Faster in Shared Memory
                                                                                                                                                                                  • Outline (4)
                                                                                                                                                                                  • One-sided Factorizations (LU QR) so far
                                                                                                                                                                                  • TSQR An Architecture-Dependent Algorithm
                                                                                                                                                                                  • Back to LU Using similar idea for TSLU as TSQR Use reduction
                                                                                                                                                                                  • Minimizing Communication in TSLU
                                                                                                                                                                                  • Making TSLU Numerically Stable
                                                                                                                                                                                  • Stability of LU using TSLU CALU
                                                                                                                                                                                  • Why is stability of TSLU just a ldquoThmrdquo
                                                                                                                                                                                  • Fixing TSLU
                                                                                                                                                                                  • 2D CALU with Tournament Pivoting
                                                                                                                                                                                  • 25D CALU with Tournament Pivoting (c=4 copies)
                                                                                                                                                                                  • Exascale Machine Parameters Source DOE Exascale Workshop
                                                                                                                                                                                  • Exascale predicted speedups for Gaussian Elimination 2D CA
                                                                                                                                                                                  • 25D vs 2D LU With and Without Pivoting
                                                                                                                                                                                  • Other CA algorithms for Ax=b least squares(13)
                                                                                                                                                                                  • Other CA algorithms for Ax=b least squares (23)
                                                                                                                                                                                  • Other CA algorithms for Ax=b least squares (33)
                                                                                                                                                                                  • Outline (5)
                                                                                                                                                                                  • What about sparse matrices (13)
                                                                                                                                                                                  • Performance of 25D APSP using Kleene
                                                                                                                                                                                  • What about sparse matrices (23)
                                                                                                                                                                                  • What about sparse matrices (33)
                                                                                                                                                                                  • Outline (6)
                                                                                                                                                                                  • Symmetric Eigenproblem and SVD
                                                                                                                                                                                  • Slide 58
                                                                                                                                                                                  • Slide 59
                                                                                                                                                                                  • Slide 60
                                                                                                                                                                                  • Slide 61
                                                                                                                                                                                  • Slide 62
                                                                                                                                                                                  • Slide 63
                                                                                                                                                                                  • Slide 64
                                                                                                                                                                                  • Slide 65
                                                                                                                                                                                  • Slide 66
                                                                                                                                                                                  • Slide 67
                                                                                                                                                                                  • Slide 68
                                                                                                                                                                                  • Conventional vs CA - SBR
                                                                                                                                                                                  • Speedups of Sym Band Reduction vs DSBTRD
                                                                                                                                                                                  • Nonsymmetric Eigenproblem
                                                                                                                                                                                  • Attaining the Lower bounds Sequential
                                                                                                                                                                                  • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
                                                                                                                                                                                  • Outline (7)
                                                                                                                                                                                  • Avoiding Communication in Iterative Linear Algebra
                                                                                                                                                                                  • Outline (8)
                                                                                                                                                                                  • Example The Difficulty of Tuning SpMV
                                                                                                                                                                                  • Example The Difficulty of Tuning
                                                                                                                                                                                  • Speedups on Itanium 2 The Need for Search
                                                                                                                                                                                  • Register Profile Itanium 2
                                                                                                                                                                                  • Register Profiles IBM and Intel IA-64
                                                                                                                                                                                  • Another example of tuning challenges for SpMV
                                                                                                                                                                                  • Zoom in to top corner
                                                                                                                                                                                  • 3x3 blocks look natural buthellip
                                                                                                                                                                                  • Extra Work Can Improve Efficiency
                                                                                                                                                                                  • Slide 86
                                                                                                                                                                                  • Slide 87
                                                                                                                                                                                  • Slide 88
                                                                                                                                                                                  • Slide 89
                                                                                                                                                                                  • Summary of Other Performance Optimizations
                                                                                                                                                                                  • Optimized Sparse Kernel Interface - OSKI
                                                                                                                                                                                  • Outline (9)
                                                                                                                                                                                  • Example Classical Conjugate Gradient (CG)
                                                                                                                                                                                  • Example CA-Conjugate Gradient
                                                                                                                                                                                  • Outline (10)
                                                                                                                                                                                  • Slide 96
                                                                                                                                                                                  • Slide 97
                                                                                                                                                                                  • Outline (11)
                                                                                                                                                                                  • What is a ldquosparse matrixrdquo
                                                                                                                                                                                  • Outline (12)
                                                                                                                                                                                  • Reproducible Floating Point Computation
                                                                                                                                                                                  • Intel MKL non-reproducibility
                                                                                                                                                                                  • GoalsApproaches for Reproducibility
                                                                                                                                                                                  • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
                                                                                                                                                                                  • Collaborators and Supporters
                                                                                                                                                                                  • Summary

                                                                                                                                                                                    Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                                                                                    ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                                                                                    ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                                                                                    bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                                                                                    bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                                                                                    93

                                                                                                                                                                                    Example Classical Conjugate Gradient (CG)

                                                                                                                                                                                    SpMVs and dot products require communication in

                                                                                                                                                                                    each iteration

                                                                                                                                                                                    via CA Matrix Powers Kernel

                                                                                                                                                                                    Global reduction to compute G

                                                                                                                                                                                    94

                                                                                                                                                                                    Example CA-Conjugate Gradient

                                                                                                                                                                                    Local computations within inner loop require

                                                                                                                                                                                    no communication

                                                                                                                                                                                    Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                                                                                    ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                                                                                    ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                                                                                    bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                                                                                    bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                                                                                    96

                                                                                                                                                                                    Slower convergence due

                                                                                                                                                                                    to roundoff

                                                                                                                                                                                    Loss of accuracy due to roundoff

                                                                                                                                                                                    At s = 16 monomial basis is rank deficient Method breaks down

                                                                                                                                                                                    Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

                                                                                                                                                                                    CA-CG (monomial)CG

                                                                                                                                                                                    machine precision

                                                                                                                                                                                    97

                                                                                                                                                                                    Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                                                                                    ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                                                                                    ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                                                                                    bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                                                                                    bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                                                                                    What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

                                                                                                                                                                                    bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

                                                                                                                                                                                    bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

                                                                                                                                                                                    matrices

                                                                                                                                                                                    Explicit (O(nnz)) Implicit (o(nnz))

                                                                                                                                                                                    Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

                                                                                                                                                                                    Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

                                                                                                                                                                                    Indices

                                                                                                                                                                                    Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                                                                                    ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                                                                                    ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                                                                                    bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                                                                                    bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                                                                                    101

                                                                                                                                                                                    bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

                                                                                                                                                                                    ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

                                                                                                                                                                                    demanded by customers (construction engineers) otherwise they donrsquot believe results

                                                                                                                                                                                    ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

                                                                                                                                                                                    ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

                                                                                                                                                                                    Reproducible Floating Point Computation

                                                                                                                                                                                    Absolute Error for Random Vectors

                                                                                                                                                                                    Same magnitude opposite signs

                                                                                                                                                                                    Intel MKL non-reproducibility

                                                                                                                                                                                    Relative Error for Orthogonal vectors

                                                                                                                                                                                    Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

                                                                                                                                                                                    Sign notreproducible

                                                                                                                                                                                    103

                                                                                                                                                                                    bull Consider summation or dot productbull Goals

                                                                                                                                                                                    1 Same answer independent of layout processors order of summands

                                                                                                                                                                                    2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

                                                                                                                                                                                    bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

                                                                                                                                                                                    GoalsApproaches for Reproducibility

                                                                                                                                                                                    104

                                                                                                                                                                                    Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

                                                                                                                                                                                    Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

                                                                                                                                                                                    Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

                                                                                                                                                                                    bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

                                                                                                                                                                                    Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

                                                                                                                                                                                    bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

                                                                                                                                                                                    Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

                                                                                                                                                                                    Instruments NEC Nokia NVIDIA Samsung Oracle

                                                                                                                                                                                    bull bebopcsberkeleyedu

                                                                                                                                                                                    Summary

                                                                                                                                                                                    Donrsquot Communichellip

                                                                                                                                                                                    106

                                                                                                                                                                                    Time to redesign all linear algebra n-body hellip algorithms and software

                                                                                                                                                                                    (and compilers)

                                                                                                                                                                                    • Implementing Communication-Avoiding Algorithms
                                                                                                                                                                                    • Why avoid communication
                                                                                                                                                                                    • Goals
                                                                                                                                                                                    • Outline
                                                                                                                                                                                    • Outline (2)
                                                                                                                                                                                    • Lower bound for all ldquon3-likerdquo linear algebra
                                                                                                                                                                                    • Lower bound for all ldquon3-likerdquo linear algebra (2)
                                                                                                                                                                                    • Lower bound for all ldquon3-likerdquo linear algebra (3)
                                                                                                                                                                                    • Limits to parallel scaling (12)
                                                                                                                                                                                    • Limits to parallel scaling (22)
                                                                                                                                                                                    • Can we attain these lower bounds
                                                                                                                                                                                    • Outline (3)
                                                                                                                                                                                    • 25D Matrix Multiplication
                                                                                                                                                                                    • 25D Matrix Multiplication (2)
                                                                                                                                                                                    • 25D Matmul on BGP 16K nodes 64K cores (2)
                                                                                                                                                                                    • Perfect Strong Scaling ndash in Time and Energy (12)
                                                                                                                                                                                    • Perfect Strong Scaling ndash in Time and Energy (22)
                                                                                                                                                                                    • Handling Heterogeneity
                                                                                                                                                                                    • Application to Tensor Contractions
                                                                                                                                                                                    • C(ijk) = Σm A(ijm)B(mk)
                                                                                                                                                                                    • Application to Tensor Contractions (2)
                                                                                                                                                                                    • Communication Lower Bounds for Strassen-like matmul algorithms
                                                                                                                                                                                    • vs
                                                                                                                                                                                    • Slide 26
                                                                                                                                                                                    • Strassen-like beyond matmul
                                                                                                                                                                                    • Cache and Network Oblivious Algorithms
                                                                                                                                                                                    • CARMA Performance Distributed Memory
                                                                                                                                                                                    • CARMA Performance Distributed Memory (2)
                                                                                                                                                                                    • CARMA Performance Shared Memory
                                                                                                                                                                                    • CARMA Performance Shared Memory (2)
                                                                                                                                                                                    • Why is CARMA Faster in Shared Memory
                                                                                                                                                                                    • Outline (4)
                                                                                                                                                                                    • One-sided Factorizations (LU QR) so far
                                                                                                                                                                                    • TSQR An Architecture-Dependent Algorithm
                                                                                                                                                                                    • Back to LU Using similar idea for TSLU as TSQR Use reduction
                                                                                                                                                                                    • Minimizing Communication in TSLU
                                                                                                                                                                                    • Making TSLU Numerically Stable
                                                                                                                                                                                    • Stability of LU using TSLU CALU
                                                                                                                                                                                    • Why is stability of TSLU just a ldquoThmrdquo
                                                                                                                                                                                    • Fixing TSLU
                                                                                                                                                                                    • 2D CALU with Tournament Pivoting
                                                                                                                                                                                    • 25D CALU with Tournament Pivoting (c=4 copies)
                                                                                                                                                                                    • Exascale Machine Parameters Source DOE Exascale Workshop
                                                                                                                                                                                    • Exascale predicted speedups for Gaussian Elimination 2D CA
                                                                                                                                                                                    • 25D vs 2D LU With and Without Pivoting
                                                                                                                                                                                    • Other CA algorithms for Ax=b least squares(13)
                                                                                                                                                                                    • Other CA algorithms for Ax=b least squares (23)
                                                                                                                                                                                    • Other CA algorithms for Ax=b least squares (33)
                                                                                                                                                                                    • Outline (5)
                                                                                                                                                                                    • What about sparse matrices (13)
                                                                                                                                                                                    • Performance of 25D APSP using Kleene
                                                                                                                                                                                    • What about sparse matrices (23)
                                                                                                                                                                                    • What about sparse matrices (33)
                                                                                                                                                                                    • Outline (6)
                                                                                                                                                                                    • Symmetric Eigenproblem and SVD
                                                                                                                                                                                    • Slide 58
                                                                                                                                                                                    • Slide 59
                                                                                                                                                                                    • Slide 60
                                                                                                                                                                                    • Slide 61
                                                                                                                                                                                    • Slide 62
                                                                                                                                                                                    • Slide 63
                                                                                                                                                                                    • Slide 64
                                                                                                                                                                                    • Slide 65
                                                                                                                                                                                    • Slide 66
                                                                                                                                                                                    • Slide 67
                                                                                                                                                                                    • Slide 68
                                                                                                                                                                                    • Conventional vs CA - SBR
                                                                                                                                                                                    • Speedups of Sym Band Reduction vs DSBTRD
                                                                                                                                                                                    • Nonsymmetric Eigenproblem
                                                                                                                                                                                    • Attaining the Lower bounds Sequential
                                                                                                                                                                                    • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
                                                                                                                                                                                    • Outline (7)
                                                                                                                                                                                    • Avoiding Communication in Iterative Linear Algebra
                                                                                                                                                                                    • Outline (8)
                                                                                                                                                                                    • Example The Difficulty of Tuning SpMV
                                                                                                                                                                                    • Example The Difficulty of Tuning
                                                                                                                                                                                    • Speedups on Itanium 2 The Need for Search
                                                                                                                                                                                    • Register Profile Itanium 2
                                                                                                                                                                                    • Register Profiles IBM and Intel IA-64
                                                                                                                                                                                    • Another example of tuning challenges for SpMV
                                                                                                                                                                                    • Zoom in to top corner
                                                                                                                                                                                    • 3x3 blocks look natural buthellip
                                                                                                                                                                                    • Extra Work Can Improve Efficiency
                                                                                                                                                                                    • Slide 86
                                                                                                                                                                                    • Slide 87
                                                                                                                                                                                    • Slide 88
                                                                                                                                                                                    • Slide 89
                                                                                                                                                                                    • Summary of Other Performance Optimizations
                                                                                                                                                                                    • Optimized Sparse Kernel Interface - OSKI
                                                                                                                                                                                    • Outline (9)
                                                                                                                                                                                    • Example Classical Conjugate Gradient (CG)
                                                                                                                                                                                    • Example CA-Conjugate Gradient
                                                                                                                                                                                    • Outline (10)
                                                                                                                                                                                    • Slide 96
                                                                                                                                                                                    • Slide 97
                                                                                                                                                                                    • Outline (11)
                                                                                                                                                                                    • What is a ldquosparse matrixrdquo
                                                                                                                                                                                    • Outline (12)
                                                                                                                                                                                    • Reproducible Floating Point Computation
                                                                                                                                                                                    • Intel MKL non-reproducibility
                                                                                                                                                                                    • GoalsApproaches for Reproducibility
                                                                                                                                                                                    • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
                                                                                                                                                                                    • Collaborators and Supporters
                                                                                                                                                                                    • Summary

                                                                                                                                                                                      93

                                                                                                                                                                                      Example Classical Conjugate Gradient (CG)

                                                                                                                                                                                      SpMVs and dot products require communication in

                                                                                                                                                                                      each iteration

                                                                                                                                                                                      via CA Matrix Powers Kernel

                                                                                                                                                                                      Global reduction to compute G

                                                                                                                                                                                      94

                                                                                                                                                                                      Example CA-Conjugate Gradient

                                                                                                                                                                                      Local computations within inner loop require

                                                                                                                                                                                      no communication

                                                                                                                                                                                      Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                                                                                      ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                                                                                      ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                                                                                      bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                                                                                      bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                                                                                      96

                                                                                                                                                                                      Slower convergence due

                                                                                                                                                                                      to roundoff

                                                                                                                                                                                      Loss of accuracy due to roundoff

                                                                                                                                                                                      At s = 16 monomial basis is rank deficient Method breaks down

                                                                                                                                                                                      Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

                                                                                                                                                                                      CA-CG (monomial)CG

                                                                                                                                                                                      machine precision

                                                                                                                                                                                      97

                                                                                                                                                                                      Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                                                                                      ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                                                                                      ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                                                                                      bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                                                                                      bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                                                                                      What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

                                                                                                                                                                                      bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

                                                                                                                                                                                      bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

                                                                                                                                                                                      matrices

                                                                                                                                                                                      Explicit (O(nnz)) Implicit (o(nnz))

                                                                                                                                                                                      Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

                                                                                                                                                                                      Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

                                                                                                                                                                                      Indices

                                                                                                                                                                                      Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                                                                                      ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                                                                                      ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                                                                                      bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                                                                                      bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                                                                                      101

                                                                                                                                                                                      bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

                                                                                                                                                                                      ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

                                                                                                                                                                                      demanded by customers (construction engineers) otherwise they donrsquot believe results

                                                                                                                                                                                      ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

                                                                                                                                                                                      ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

                                                                                                                                                                                      Reproducible Floating Point Computation

                                                                                                                                                                                      Absolute Error for Random Vectors

                                                                                                                                                                                      Same magnitude opposite signs

                                                                                                                                                                                      Intel MKL non-reproducibility

                                                                                                                                                                                      Relative Error for Orthogonal vectors

                                                                                                                                                                                      Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

                                                                                                                                                                                      Sign notreproducible

                                                                                                                                                                                      103

                                                                                                                                                                                      bull Consider summation or dot productbull Goals

                                                                                                                                                                                      1 Same answer independent of layout processors order of summands

                                                                                                                                                                                      2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

                                                                                                                                                                                      bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

                                                                                                                                                                                      GoalsApproaches for Reproducibility

                                                                                                                                                                                      104

                                                                                                                                                                                      Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

                                                                                                                                                                                      Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

                                                                                                                                                                                      Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

                                                                                                                                                                                      bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

                                                                                                                                                                                      Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

                                                                                                                                                                                      bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

                                                                                                                                                                                      Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

                                                                                                                                                                                      Instruments NEC Nokia NVIDIA Samsung Oracle

                                                                                                                                                                                      bull bebopcsberkeleyedu

                                                                                                                                                                                      Summary

                                                                                                                                                                                      Donrsquot Communichellip

                                                                                                                                                                                      106

                                                                                                                                                                                      Time to redesign all linear algebra n-body hellip algorithms and software

                                                                                                                                                                                      (and compilers)

                                                                                                                                                                                      • Implementing Communication-Avoiding Algorithms
                                                                                                                                                                                      • Why avoid communication
                                                                                                                                                                                      • Goals
                                                                                                                                                                                      • Outline
                                                                                                                                                                                      • Outline (2)
                                                                                                                                                                                      • Lower bound for all ldquon3-likerdquo linear algebra
                                                                                                                                                                                      • Lower bound for all ldquon3-likerdquo linear algebra (2)
                                                                                                                                                                                      • Lower bound for all ldquon3-likerdquo linear algebra (3)
                                                                                                                                                                                      • Limits to parallel scaling (12)
                                                                                                                                                                                      • Limits to parallel scaling (22)
                                                                                                                                                                                      • Can we attain these lower bounds
                                                                                                                                                                                      • Outline (3)
                                                                                                                                                                                      • 25D Matrix Multiplication
                                                                                                                                                                                      • 25D Matrix Multiplication (2)
                                                                                                                                                                                      • 25D Matmul on BGP 16K nodes 64K cores (2)
                                                                                                                                                                                      • Perfect Strong Scaling ndash in Time and Energy (12)
                                                                                                                                                                                      • Perfect Strong Scaling ndash in Time and Energy (22)
                                                                                                                                                                                      • Handling Heterogeneity
                                                                                                                                                                                      • Application to Tensor Contractions
                                                                                                                                                                                      • C(ijk) = Σm A(ijm)B(mk)
                                                                                                                                                                                      • Application to Tensor Contractions (2)
                                                                                                                                                                                      • Communication Lower Bounds for Strassen-like matmul algorithms
                                                                                                                                                                                      • vs
                                                                                                                                                                                      • Slide 26
                                                                                                                                                                                      • Strassen-like beyond matmul
                                                                                                                                                                                      • Cache and Network Oblivious Algorithms
                                                                                                                                                                                      • CARMA Performance Distributed Memory
                                                                                                                                                                                      • CARMA Performance Distributed Memory (2)
                                                                                                                                                                                      • CARMA Performance Shared Memory
                                                                                                                                                                                      • CARMA Performance Shared Memory (2)
                                                                                                                                                                                      • Why is CARMA Faster in Shared Memory
                                                                                                                                                                                      • Outline (4)
                                                                                                                                                                                      • One-sided Factorizations (LU QR) so far
                                                                                                                                                                                      • TSQR An Architecture-Dependent Algorithm
                                                                                                                                                                                      • Back to LU Using similar idea for TSLU as TSQR Use reduction
                                                                                                                                                                                      • Minimizing Communication in TSLU
                                                                                                                                                                                      • Making TSLU Numerically Stable
                                                                                                                                                                                      • Stability of LU using TSLU CALU
                                                                                                                                                                                      • Why is stability of TSLU just a ldquoThmrdquo
                                                                                                                                                                                      • Fixing TSLU
                                                                                                                                                                                      • 2D CALU with Tournament Pivoting
                                                                                                                                                                                      • 25D CALU with Tournament Pivoting (c=4 copies)
                                                                                                                                                                                      • Exascale Machine Parameters Source DOE Exascale Workshop
                                                                                                                                                                                      • Exascale predicted speedups for Gaussian Elimination 2D CA
                                                                                                                                                                                      • 25D vs 2D LU With and Without Pivoting
                                                                                                                                                                                      • Other CA algorithms for Ax=b least squares(13)
                                                                                                                                                                                      • Other CA algorithms for Ax=b least squares (23)
                                                                                                                                                                                      • Other CA algorithms for Ax=b least squares (33)
                                                                                                                                                                                      • Outline (5)
                                                                                                                                                                                      • What about sparse matrices (13)
                                                                                                                                                                                      • Performance of 25D APSP using Kleene
                                                                                                                                                                                      • What about sparse matrices (23)
                                                                                                                                                                                      • What about sparse matrices (33)
                                                                                                                                                                                      • Outline (6)
                                                                                                                                                                                      • Symmetric Eigenproblem and SVD
                                                                                                                                                                                      • Slide 58
                                                                                                                                                                                      • Slide 59
                                                                                                                                                                                      • Slide 60
                                                                                                                                                                                      • Slide 61
                                                                                                                                                                                      • Slide 62
                                                                                                                                                                                      • Slide 63
                                                                                                                                                                                      • Slide 64
                                                                                                                                                                                      • Slide 65
                                                                                                                                                                                      • Slide 66
                                                                                                                                                                                      • Slide 67
                                                                                                                                                                                      • Slide 68
                                                                                                                                                                                      • Conventional vs CA - SBR
                                                                                                                                                                                      • Speedups of Sym Band Reduction vs DSBTRD
                                                                                                                                                                                      • Nonsymmetric Eigenproblem
                                                                                                                                                                                      • Attaining the Lower bounds Sequential
                                                                                                                                                                                      • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
                                                                                                                                                                                      • Outline (7)
                                                                                                                                                                                      • Avoiding Communication in Iterative Linear Algebra
                                                                                                                                                                                      • Outline (8)
                                                                                                                                                                                      • Example The Difficulty of Tuning SpMV
                                                                                                                                                                                      • Example The Difficulty of Tuning
                                                                                                                                                                                      • Speedups on Itanium 2 The Need for Search
                                                                                                                                                                                      • Register Profile Itanium 2
                                                                                                                                                                                      • Register Profiles IBM and Intel IA-64
                                                                                                                                                                                      • Another example of tuning challenges for SpMV
                                                                                                                                                                                      • Zoom in to top corner
                                                                                                                                                                                      • 3x3 blocks look natural buthellip
                                                                                                                                                                                      • Extra Work Can Improve Efficiency
                                                                                                                                                                                      • Slide 86
                                                                                                                                                                                      • Slide 87
                                                                                                                                                                                      • Slide 88
                                                                                                                                                                                      • Slide 89
                                                                                                                                                                                      • Summary of Other Performance Optimizations
                                                                                                                                                                                      • Optimized Sparse Kernel Interface - OSKI
                                                                                                                                                                                      • Outline (9)
                                                                                                                                                                                      • Example Classical Conjugate Gradient (CG)
                                                                                                                                                                                      • Example CA-Conjugate Gradient
                                                                                                                                                                                      • Outline (10)
                                                                                                                                                                                      • Slide 96
                                                                                                                                                                                      • Slide 97
                                                                                                                                                                                      • Outline (11)
                                                                                                                                                                                      • What is a ldquosparse matrixrdquo
                                                                                                                                                                                      • Outline (12)
                                                                                                                                                                                      • Reproducible Floating Point Computation
                                                                                                                                                                                      • Intel MKL non-reproducibility
                                                                                                                                                                                      • GoalsApproaches for Reproducibility
                                                                                                                                                                                      • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
                                                                                                                                                                                      • Collaborators and Supporters
                                                                                                                                                                                      • Summary

                                                                                                                                                                                        via CA Matrix Powers Kernel

                                                                                                                                                                                        Global reduction to compute G

                                                                                                                                                                                        94

                                                                                                                                                                                        Example CA-Conjugate Gradient

                                                                                                                                                                                        Local computations within inner loop require

                                                                                                                                                                                        no communication

                                                                                                                                                                                        Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                                                                                        ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                                                                                        ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                                                                                        bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                                                                                        bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                                                                                        96

                                                                                                                                                                                        Slower convergence due

                                                                                                                                                                                        to roundoff

                                                                                                                                                                                        Loss of accuracy due to roundoff

                                                                                                                                                                                        At s = 16 monomial basis is rank deficient Method breaks down

                                                                                                                                                                                        Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

                                                                                                                                                                                        CA-CG (monomial)CG

                                                                                                                                                                                        machine precision

                                                                                                                                                                                        97

                                                                                                                                                                                        Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                                                                                        ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                                                                                        ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                                                                                        bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                                                                                        bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                                                                                        What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

                                                                                                                                                                                        bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

                                                                                                                                                                                        bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

                                                                                                                                                                                        matrices

                                                                                                                                                                                        Explicit (O(nnz)) Implicit (o(nnz))

                                                                                                                                                                                        Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

                                                                                                                                                                                        Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

                                                                                                                                                                                        Indices

                                                                                                                                                                                        Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                                                                                        ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                                                                                        ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                                                                                        bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                                                                                        bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                                                                                        101

                                                                                                                                                                                        bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

                                                                                                                                                                                        ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

                                                                                                                                                                                        demanded by customers (construction engineers) otherwise they donrsquot believe results

                                                                                                                                                                                        ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

                                                                                                                                                                                        ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

                                                                                                                                                                                        Reproducible Floating Point Computation

                                                                                                                                                                                        Absolute Error for Random Vectors

                                                                                                                                                                                        Same magnitude opposite signs

                                                                                                                                                                                        Intel MKL non-reproducibility

                                                                                                                                                                                        Relative Error for Orthogonal vectors

                                                                                                                                                                                        Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

                                                                                                                                                                                        Sign notreproducible

                                                                                                                                                                                        103

                                                                                                                                                                                        bull Consider summation or dot productbull Goals

                                                                                                                                                                                        1 Same answer independent of layout processors order of summands

                                                                                                                                                                                        2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

                                                                                                                                                                                        bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

                                                                                                                                                                                        GoalsApproaches for Reproducibility

                                                                                                                                                                                        104

                                                                                                                                                                                        Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

                                                                                                                                                                                        Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

                                                                                                                                                                                        Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

                                                                                                                                                                                        bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

                                                                                                                                                                                        Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

                                                                                                                                                                                        bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

                                                                                                                                                                                        Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

                                                                                                                                                                                        Instruments NEC Nokia NVIDIA Samsung Oracle

                                                                                                                                                                                        bull bebopcsberkeleyedu

                                                                                                                                                                                        Summary

                                                                                                                                                                                        Donrsquot Communichellip

                                                                                                                                                                                        106

                                                                                                                                                                                        Time to redesign all linear algebra n-body hellip algorithms and software

                                                                                                                                                                                        (and compilers)

                                                                                                                                                                                        • Implementing Communication-Avoiding Algorithms
                                                                                                                                                                                        • Why avoid communication
                                                                                                                                                                                        • Goals
                                                                                                                                                                                        • Outline
                                                                                                                                                                                        • Outline (2)
                                                                                                                                                                                        • Lower bound for all ldquon3-likerdquo linear algebra
                                                                                                                                                                                        • Lower bound for all ldquon3-likerdquo linear algebra (2)
                                                                                                                                                                                        • Lower bound for all ldquon3-likerdquo linear algebra (3)
                                                                                                                                                                                        • Limits to parallel scaling (12)
                                                                                                                                                                                        • Limits to parallel scaling (22)
                                                                                                                                                                                        • Can we attain these lower bounds
                                                                                                                                                                                        • Outline (3)
                                                                                                                                                                                        • 25D Matrix Multiplication
                                                                                                                                                                                        • 25D Matrix Multiplication (2)
                                                                                                                                                                                        • 25D Matmul on BGP 16K nodes 64K cores (2)
                                                                                                                                                                                        • Perfect Strong Scaling ndash in Time and Energy (12)
                                                                                                                                                                                        • Perfect Strong Scaling ndash in Time and Energy (22)
                                                                                                                                                                                        • Handling Heterogeneity
                                                                                                                                                                                        • Application to Tensor Contractions
                                                                                                                                                                                        • C(ijk) = Σm A(ijm)B(mk)
                                                                                                                                                                                        • Application to Tensor Contractions (2)
                                                                                                                                                                                        • Communication Lower Bounds for Strassen-like matmul algorithms
                                                                                                                                                                                        • vs
                                                                                                                                                                                        • Slide 26
                                                                                                                                                                                        • Strassen-like beyond matmul
                                                                                                                                                                                        • Cache and Network Oblivious Algorithms
                                                                                                                                                                                        • CARMA Performance Distributed Memory
                                                                                                                                                                                        • CARMA Performance Distributed Memory (2)
                                                                                                                                                                                        • CARMA Performance Shared Memory
                                                                                                                                                                                        • CARMA Performance Shared Memory (2)
                                                                                                                                                                                        • Why is CARMA Faster in Shared Memory
                                                                                                                                                                                        • Outline (4)
                                                                                                                                                                                        • One-sided Factorizations (LU QR) so far
                                                                                                                                                                                        • TSQR An Architecture-Dependent Algorithm
                                                                                                                                                                                        • Back to LU Using similar idea for TSLU as TSQR Use reduction
                                                                                                                                                                                        • Minimizing Communication in TSLU
                                                                                                                                                                                        • Making TSLU Numerically Stable
                                                                                                                                                                                        • Stability of LU using TSLU CALU
                                                                                                                                                                                        • Why is stability of TSLU just a ldquoThmrdquo
                                                                                                                                                                                        • Fixing TSLU
                                                                                                                                                                                        • 2D CALU with Tournament Pivoting
                                                                                                                                                                                        • 25D CALU with Tournament Pivoting (c=4 copies)
                                                                                                                                                                                        • Exascale Machine Parameters Source DOE Exascale Workshop
                                                                                                                                                                                        • Exascale predicted speedups for Gaussian Elimination 2D CA
                                                                                                                                                                                        • 25D vs 2D LU With and Without Pivoting
                                                                                                                                                                                        • Other CA algorithms for Ax=b least squares(13)
                                                                                                                                                                                        • Other CA algorithms for Ax=b least squares (23)
                                                                                                                                                                                        • Other CA algorithms for Ax=b least squares (33)
                                                                                                                                                                                        • Outline (5)
                                                                                                                                                                                        • What about sparse matrices (13)
                                                                                                                                                                                        • Performance of 25D APSP using Kleene
                                                                                                                                                                                        • What about sparse matrices (23)
                                                                                                                                                                                        • What about sparse matrices (33)
                                                                                                                                                                                        • Outline (6)
                                                                                                                                                                                        • Symmetric Eigenproblem and SVD
                                                                                                                                                                                        • Slide 58
                                                                                                                                                                                        • Slide 59
                                                                                                                                                                                        • Slide 60
                                                                                                                                                                                        • Slide 61
                                                                                                                                                                                        • Slide 62
                                                                                                                                                                                        • Slide 63
                                                                                                                                                                                        • Slide 64
                                                                                                                                                                                        • Slide 65
                                                                                                                                                                                        • Slide 66
                                                                                                                                                                                        • Slide 67
                                                                                                                                                                                        • Slide 68
                                                                                                                                                                                        • Conventional vs CA - SBR
                                                                                                                                                                                        • Speedups of Sym Band Reduction vs DSBTRD
                                                                                                                                                                                        • Nonsymmetric Eigenproblem
                                                                                                                                                                                        • Attaining the Lower bounds Sequential
                                                                                                                                                                                        • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
                                                                                                                                                                                        • Outline (7)
                                                                                                                                                                                        • Avoiding Communication in Iterative Linear Algebra
                                                                                                                                                                                        • Outline (8)
                                                                                                                                                                                        • Example The Difficulty of Tuning SpMV
                                                                                                                                                                                        • Example The Difficulty of Tuning
                                                                                                                                                                                        • Speedups on Itanium 2 The Need for Search
                                                                                                                                                                                        • Register Profile Itanium 2
                                                                                                                                                                                        • Register Profiles IBM and Intel IA-64
                                                                                                                                                                                        • Another example of tuning challenges for SpMV
                                                                                                                                                                                        • Zoom in to top corner
                                                                                                                                                                                        • 3x3 blocks look natural buthellip
                                                                                                                                                                                        • Extra Work Can Improve Efficiency
                                                                                                                                                                                        • Slide 86
                                                                                                                                                                                        • Slide 87
                                                                                                                                                                                        • Slide 88
                                                                                                                                                                                        • Slide 89
                                                                                                                                                                                        • Summary of Other Performance Optimizations
                                                                                                                                                                                        • Optimized Sparse Kernel Interface - OSKI
                                                                                                                                                                                        • Outline (9)
                                                                                                                                                                                        • Example Classical Conjugate Gradient (CG)
                                                                                                                                                                                        • Example CA-Conjugate Gradient
                                                                                                                                                                                        • Outline (10)
                                                                                                                                                                                        • Slide 96
                                                                                                                                                                                        • Slide 97
                                                                                                                                                                                        • Outline (11)
                                                                                                                                                                                        • What is a ldquosparse matrixrdquo
                                                                                                                                                                                        • Outline (12)
                                                                                                                                                                                        • Reproducible Floating Point Computation
                                                                                                                                                                                        • Intel MKL non-reproducibility
                                                                                                                                                                                        • GoalsApproaches for Reproducibility
                                                                                                                                                                                        • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
                                                                                                                                                                                        • Collaborators and Supporters
                                                                                                                                                                                        • Summary

                                                                                                                                                                                          Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                                                                                          ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                                                                                          ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                                                                                          bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                                                                                          bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                                                                                          96

                                                                                                                                                                                          Slower convergence due

                                                                                                                                                                                          to roundoff

                                                                                                                                                                                          Loss of accuracy due to roundoff

                                                                                                                                                                                          At s = 16 monomial basis is rank deficient Method breaks down

                                                                                                                                                                                          Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

                                                                                                                                                                                          CA-CG (monomial)CG

                                                                                                                                                                                          machine precision

                                                                                                                                                                                          97

                                                                                                                                                                                          Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                                                                                          ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                                                                                          ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                                                                                          bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                                                                                          bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                                                                                          What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

                                                                                                                                                                                          bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

                                                                                                                                                                                          bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

                                                                                                                                                                                          matrices

                                                                                                                                                                                          Explicit (O(nnz)) Implicit (o(nnz))

                                                                                                                                                                                          Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

                                                                                                                                                                                          Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

                                                                                                                                                                                          Indices

                                                                                                                                                                                          Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                                                                                          ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                                                                                          ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                                                                                          bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                                                                                          bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                                                                                          101

                                                                                                                                                                                          bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

                                                                                                                                                                                          ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

                                                                                                                                                                                          demanded by customers (construction engineers) otherwise they donrsquot believe results

                                                                                                                                                                                          ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

                                                                                                                                                                                          ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

                                                                                                                                                                                          Reproducible Floating Point Computation

                                                                                                                                                                                          Absolute Error for Random Vectors

                                                                                                                                                                                          Same magnitude opposite signs

                                                                                                                                                                                          Intel MKL non-reproducibility

                                                                                                                                                                                          Relative Error for Orthogonal vectors

                                                                                                                                                                                          Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

                                                                                                                                                                                          Sign notreproducible

                                                                                                                                                                                          103

                                                                                                                                                                                          bull Consider summation or dot productbull Goals

                                                                                                                                                                                          1 Same answer independent of layout processors order of summands

                                                                                                                                                                                          2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

                                                                                                                                                                                          bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

                                                                                                                                                                                          GoalsApproaches for Reproducibility

                                                                                                                                                                                          104

                                                                                                                                                                                          Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

                                                                                                                                                                                          Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

                                                                                                                                                                                          Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

                                                                                                                                                                                          bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

                                                                                                                                                                                          Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

                                                                                                                                                                                          bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

                                                                                                                                                                                          Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

                                                                                                                                                                                          Instruments NEC Nokia NVIDIA Samsung Oracle

                                                                                                                                                                                          bull bebopcsberkeleyedu

                                                                                                                                                                                          Summary

                                                                                                                                                                                          Donrsquot Communichellip

                                                                                                                                                                                          106

                                                                                                                                                                                          Time to redesign all linear algebra n-body hellip algorithms and software

                                                                                                                                                                                          (and compilers)

                                                                                                                                                                                          • Implementing Communication-Avoiding Algorithms
                                                                                                                                                                                          • Why avoid communication
                                                                                                                                                                                          • Goals
                                                                                                                                                                                          • Outline
                                                                                                                                                                                          • Outline (2)
                                                                                                                                                                                          • Lower bound for all ldquon3-likerdquo linear algebra
                                                                                                                                                                                          • Lower bound for all ldquon3-likerdquo linear algebra (2)
                                                                                                                                                                                          • Lower bound for all ldquon3-likerdquo linear algebra (3)
                                                                                                                                                                                          • Limits to parallel scaling (12)
                                                                                                                                                                                          • Limits to parallel scaling (22)
                                                                                                                                                                                          • Can we attain these lower bounds
                                                                                                                                                                                          • Outline (3)
                                                                                                                                                                                          • 25D Matrix Multiplication
                                                                                                                                                                                          • 25D Matrix Multiplication (2)
                                                                                                                                                                                          • 25D Matmul on BGP 16K nodes 64K cores (2)
                                                                                                                                                                                          • Perfect Strong Scaling ndash in Time and Energy (12)
                                                                                                                                                                                          • Perfect Strong Scaling ndash in Time and Energy (22)
                                                                                                                                                                                          • Handling Heterogeneity
                                                                                                                                                                                          • Application to Tensor Contractions
                                                                                                                                                                                          • C(ijk) = Σm A(ijm)B(mk)
                                                                                                                                                                                          • Application to Tensor Contractions (2)
                                                                                                                                                                                          • Communication Lower Bounds for Strassen-like matmul algorithms
                                                                                                                                                                                          • vs
                                                                                                                                                                                          • Slide 26
                                                                                                                                                                                          • Strassen-like beyond matmul
                                                                                                                                                                                          • Cache and Network Oblivious Algorithms
                                                                                                                                                                                          • CARMA Performance Distributed Memory
                                                                                                                                                                                          • CARMA Performance Distributed Memory (2)
                                                                                                                                                                                          • CARMA Performance Shared Memory
                                                                                                                                                                                          • CARMA Performance Shared Memory (2)
                                                                                                                                                                                          • Why is CARMA Faster in Shared Memory
                                                                                                                                                                                          • Outline (4)
                                                                                                                                                                                          • One-sided Factorizations (LU QR) so far
                                                                                                                                                                                          • TSQR An Architecture-Dependent Algorithm
                                                                                                                                                                                          • Back to LU Using similar idea for TSLU as TSQR Use reduction
                                                                                                                                                                                          • Minimizing Communication in TSLU
                                                                                                                                                                                          • Making TSLU Numerically Stable
                                                                                                                                                                                          • Stability of LU using TSLU CALU
                                                                                                                                                                                          • Why is stability of TSLU just a ldquoThmrdquo
                                                                                                                                                                                          • Fixing TSLU
                                                                                                                                                                                          • 2D CALU with Tournament Pivoting
                                                                                                                                                                                          • 25D CALU with Tournament Pivoting (c=4 copies)
                                                                                                                                                                                          • Exascale Machine Parameters Source DOE Exascale Workshop
                                                                                                                                                                                          • Exascale predicted speedups for Gaussian Elimination 2D CA
                                                                                                                                                                                          • 25D vs 2D LU With and Without Pivoting
                                                                                                                                                                                          • Other CA algorithms for Ax=b least squares(13)
                                                                                                                                                                                          • Other CA algorithms for Ax=b least squares (23)
                                                                                                                                                                                          • Other CA algorithms for Ax=b least squares (33)
                                                                                                                                                                                          • Outline (5)
                                                                                                                                                                                          • What about sparse matrices (13)
                                                                                                                                                                                          • Performance of 25D APSP using Kleene
                                                                                                                                                                                          • What about sparse matrices (23)
                                                                                                                                                                                          • What about sparse matrices (33)
                                                                                                                                                                                          • Outline (6)
                                                                                                                                                                                          • Symmetric Eigenproblem and SVD
                                                                                                                                                                                          • Slide 58
                                                                                                                                                                                          • Slide 59
                                                                                                                                                                                          • Slide 60
                                                                                                                                                                                          • Slide 61
                                                                                                                                                                                          • Slide 62
                                                                                                                                                                                          • Slide 63
                                                                                                                                                                                          • Slide 64
                                                                                                                                                                                          • Slide 65
                                                                                                                                                                                          • Slide 66
                                                                                                                                                                                          • Slide 67
                                                                                                                                                                                          • Slide 68
                                                                                                                                                                                          • Conventional vs CA - SBR
                                                                                                                                                                                          • Speedups of Sym Band Reduction vs DSBTRD
                                                                                                                                                                                          • Nonsymmetric Eigenproblem
                                                                                                                                                                                          • Attaining the Lower bounds Sequential
                                                                                                                                                                                          • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
                                                                                                                                                                                          • Outline (7)
                                                                                                                                                                                          • Avoiding Communication in Iterative Linear Algebra
                                                                                                                                                                                          • Outline (8)
                                                                                                                                                                                          • Example The Difficulty of Tuning SpMV
                                                                                                                                                                                          • Example The Difficulty of Tuning
                                                                                                                                                                                          • Speedups on Itanium 2 The Need for Search
                                                                                                                                                                                          • Register Profile Itanium 2
                                                                                                                                                                                          • Register Profiles IBM and Intel IA-64
                                                                                                                                                                                          • Another example of tuning challenges for SpMV
                                                                                                                                                                                          • Zoom in to top corner
                                                                                                                                                                                          • 3x3 blocks look natural buthellip
                                                                                                                                                                                          • Extra Work Can Improve Efficiency
                                                                                                                                                                                          • Slide 86
                                                                                                                                                                                          • Slide 87
                                                                                                                                                                                          • Slide 88
                                                                                                                                                                                          • Slide 89
                                                                                                                                                                                          • Summary of Other Performance Optimizations
                                                                                                                                                                                          • Optimized Sparse Kernel Interface - OSKI
                                                                                                                                                                                          • Outline (9)
                                                                                                                                                                                          • Example Classical Conjugate Gradient (CG)
                                                                                                                                                                                          • Example CA-Conjugate Gradient
                                                                                                                                                                                          • Outline (10)
                                                                                                                                                                                          • Slide 96
                                                                                                                                                                                          • Slide 97
                                                                                                                                                                                          • Outline (11)
                                                                                                                                                                                          • What is a ldquosparse matrixrdquo
                                                                                                                                                                                          • Outline (12)
                                                                                                                                                                                          • Reproducible Floating Point Computation
                                                                                                                                                                                          • Intel MKL non-reproducibility
                                                                                                                                                                                          • GoalsApproaches for Reproducibility
                                                                                                                                                                                          • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
                                                                                                                                                                                          • Collaborators and Supporters
                                                                                                                                                                                          • Summary

                                                                                                                                                                                            96

                                                                                                                                                                                            Slower convergence due

                                                                                                                                                                                            to roundoff

                                                                                                                                                                                            Loss of accuracy due to roundoff

                                                                                                                                                                                            At s = 16 monomial basis is rank deficient Method breaks down

                                                                                                                                                                                            Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

                                                                                                                                                                                            CA-CG (monomial)CG

                                                                                                                                                                                            machine precision

                                                                                                                                                                                            97

                                                                                                                                                                                            Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                                                                                            ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                                                                                            ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                                                                                            bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                                                                                            bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                                                                                            What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

                                                                                                                                                                                            bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

                                                                                                                                                                                            bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

                                                                                                                                                                                            matrices

                                                                                                                                                                                            Explicit (O(nnz)) Implicit (o(nnz))

                                                                                                                                                                                            Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

                                                                                                                                                                                            Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

                                                                                                                                                                                            Indices

                                                                                                                                                                                            Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                                                                                            ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                                                                                            ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                                                                                            bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                                                                                            bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                                                                                            101

                                                                                                                                                                                            bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

                                                                                                                                                                                            ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

                                                                                                                                                                                            demanded by customers (construction engineers) otherwise they donrsquot believe results

                                                                                                                                                                                            ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

                                                                                                                                                                                            ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

                                                                                                                                                                                            Reproducible Floating Point Computation

                                                                                                                                                                                            Absolute Error for Random Vectors

                                                                                                                                                                                            Same magnitude opposite signs

                                                                                                                                                                                            Intel MKL non-reproducibility

                                                                                                                                                                                            Relative Error for Orthogonal vectors

                                                                                                                                                                                            Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

                                                                                                                                                                                            Sign notreproducible

                                                                                                                                                                                            103

                                                                                                                                                                                            bull Consider summation or dot productbull Goals

                                                                                                                                                                                            1 Same answer independent of layout processors order of summands

                                                                                                                                                                                            2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

                                                                                                                                                                                            bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

                                                                                                                                                                                            GoalsApproaches for Reproducibility

                                                                                                                                                                                            104

                                                                                                                                                                                            Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

                                                                                                                                                                                            Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

                                                                                                                                                                                            Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

                                                                                                                                                                                            bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

                                                                                                                                                                                            Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

                                                                                                                                                                                            bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

                                                                                                                                                                                            Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

                                                                                                                                                                                            Instruments NEC Nokia NVIDIA Samsung Oracle

                                                                                                                                                                                            bull bebopcsberkeleyedu

                                                                                                                                                                                            Summary

                                                                                                                                                                                            Donrsquot Communichellip

                                                                                                                                                                                            106

                                                                                                                                                                                            Time to redesign all linear algebra n-body hellip algorithms and software

                                                                                                                                                                                            (and compilers)

                                                                                                                                                                                            • Implementing Communication-Avoiding Algorithms
                                                                                                                                                                                            • Why avoid communication
                                                                                                                                                                                            • Goals
                                                                                                                                                                                            • Outline
                                                                                                                                                                                            • Outline (2)
                                                                                                                                                                                            • Lower bound for all ldquon3-likerdquo linear algebra
                                                                                                                                                                                            • Lower bound for all ldquon3-likerdquo linear algebra (2)
                                                                                                                                                                                            • Lower bound for all ldquon3-likerdquo linear algebra (3)
                                                                                                                                                                                            • Limits to parallel scaling (12)
                                                                                                                                                                                            • Limits to parallel scaling (22)
                                                                                                                                                                                            • Can we attain these lower bounds
                                                                                                                                                                                            • Outline (3)
                                                                                                                                                                                            • 25D Matrix Multiplication
                                                                                                                                                                                            • 25D Matrix Multiplication (2)
                                                                                                                                                                                            • 25D Matmul on BGP 16K nodes 64K cores (2)
                                                                                                                                                                                            • Perfect Strong Scaling ndash in Time and Energy (12)
                                                                                                                                                                                            • Perfect Strong Scaling ndash in Time and Energy (22)
                                                                                                                                                                                            • Handling Heterogeneity
                                                                                                                                                                                            • Application to Tensor Contractions
                                                                                                                                                                                            • C(ijk) = Σm A(ijm)B(mk)
                                                                                                                                                                                            • Application to Tensor Contractions (2)
                                                                                                                                                                                            • Communication Lower Bounds for Strassen-like matmul algorithms
                                                                                                                                                                                            • vs
                                                                                                                                                                                            • Slide 26
                                                                                                                                                                                            • Strassen-like beyond matmul
                                                                                                                                                                                            • Cache and Network Oblivious Algorithms
                                                                                                                                                                                            • CARMA Performance Distributed Memory
                                                                                                                                                                                            • CARMA Performance Distributed Memory (2)
                                                                                                                                                                                            • CARMA Performance Shared Memory
                                                                                                                                                                                            • CARMA Performance Shared Memory (2)
                                                                                                                                                                                            • Why is CARMA Faster in Shared Memory
                                                                                                                                                                                            • Outline (4)
                                                                                                                                                                                            • One-sided Factorizations (LU QR) so far
                                                                                                                                                                                            • TSQR An Architecture-Dependent Algorithm
                                                                                                                                                                                            • Back to LU Using similar idea for TSLU as TSQR Use reduction
                                                                                                                                                                                            • Minimizing Communication in TSLU
                                                                                                                                                                                            • Making TSLU Numerically Stable
                                                                                                                                                                                            • Stability of LU using TSLU CALU
                                                                                                                                                                                            • Why is stability of TSLU just a ldquoThmrdquo
                                                                                                                                                                                            • Fixing TSLU
                                                                                                                                                                                            • 2D CALU with Tournament Pivoting
                                                                                                                                                                                            • 25D CALU with Tournament Pivoting (c=4 copies)
                                                                                                                                                                                            • Exascale Machine Parameters Source DOE Exascale Workshop
                                                                                                                                                                                            • Exascale predicted speedups for Gaussian Elimination 2D CA
                                                                                                                                                                                            • 25D vs 2D LU With and Without Pivoting
                                                                                                                                                                                            • Other CA algorithms for Ax=b least squares(13)
                                                                                                                                                                                            • Other CA algorithms for Ax=b least squares (23)
                                                                                                                                                                                            • Other CA algorithms for Ax=b least squares (33)
                                                                                                                                                                                            • Outline (5)
                                                                                                                                                                                            • What about sparse matrices (13)
                                                                                                                                                                                            • Performance of 25D APSP using Kleene
                                                                                                                                                                                            • What about sparse matrices (23)
                                                                                                                                                                                            • What about sparse matrices (33)
                                                                                                                                                                                            • Outline (6)
                                                                                                                                                                                            • Symmetric Eigenproblem and SVD
                                                                                                                                                                                            • Slide 58
                                                                                                                                                                                            • Slide 59
                                                                                                                                                                                            • Slide 60
                                                                                                                                                                                            • Slide 61
                                                                                                                                                                                            • Slide 62
                                                                                                                                                                                            • Slide 63
                                                                                                                                                                                            • Slide 64
                                                                                                                                                                                            • Slide 65
                                                                                                                                                                                            • Slide 66
                                                                                                                                                                                            • Slide 67
                                                                                                                                                                                            • Slide 68
                                                                                                                                                                                            • Conventional vs CA - SBR
                                                                                                                                                                                            • Speedups of Sym Band Reduction vs DSBTRD
                                                                                                                                                                                            • Nonsymmetric Eigenproblem
                                                                                                                                                                                            • Attaining the Lower bounds Sequential
                                                                                                                                                                                            • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
                                                                                                                                                                                            • Outline (7)
                                                                                                                                                                                            • Avoiding Communication in Iterative Linear Algebra
                                                                                                                                                                                            • Outline (8)
                                                                                                                                                                                            • Example The Difficulty of Tuning SpMV
                                                                                                                                                                                            • Example The Difficulty of Tuning
                                                                                                                                                                                            • Speedups on Itanium 2 The Need for Search
                                                                                                                                                                                            • Register Profile Itanium 2
                                                                                                                                                                                            • Register Profiles IBM and Intel IA-64
                                                                                                                                                                                            • Another example of tuning challenges for SpMV
                                                                                                                                                                                            • Zoom in to top corner
                                                                                                                                                                                            • 3x3 blocks look natural buthellip
                                                                                                                                                                                            • Extra Work Can Improve Efficiency
                                                                                                                                                                                            • Slide 86
                                                                                                                                                                                            • Slide 87
                                                                                                                                                                                            • Slide 88
                                                                                                                                                                                            • Slide 89
                                                                                                                                                                                            • Summary of Other Performance Optimizations
                                                                                                                                                                                            • Optimized Sparse Kernel Interface - OSKI
                                                                                                                                                                                            • Outline (9)
                                                                                                                                                                                            • Example Classical Conjugate Gradient (CG)
                                                                                                                                                                                            • Example CA-Conjugate Gradient
                                                                                                                                                                                            • Outline (10)
                                                                                                                                                                                            • Slide 96
                                                                                                                                                                                            • Slide 97
                                                                                                                                                                                            • Outline (11)
                                                                                                                                                                                            • What is a ldquosparse matrixrdquo
                                                                                                                                                                                            • Outline (12)
                                                                                                                                                                                            • Reproducible Floating Point Computation
                                                                                                                                                                                            • Intel MKL non-reproducibility
                                                                                                                                                                                            • GoalsApproaches for Reproducibility
                                                                                                                                                                                            • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
                                                                                                                                                                                            • Collaborators and Supporters
                                                                                                                                                                                            • Summary

                                                                                                                                                                                              97

                                                                                                                                                                                              Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                                                                                              ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                                                                                              ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                                                                                              bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                                                                                              bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                                                                                              What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

                                                                                                                                                                                              bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

                                                                                                                                                                                              bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

                                                                                                                                                                                              matrices

                                                                                                                                                                                              Explicit (O(nnz)) Implicit (o(nnz))

                                                                                                                                                                                              Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

                                                                                                                                                                                              Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

                                                                                                                                                                                              Indices

                                                                                                                                                                                              Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                                                                                              ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                                                                                              ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                                                                                              bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                                                                                              bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                                                                                              101

                                                                                                                                                                                              bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

                                                                                                                                                                                              ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

                                                                                                                                                                                              demanded by customers (construction engineers) otherwise they donrsquot believe results

                                                                                                                                                                                              ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

                                                                                                                                                                                              ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

                                                                                                                                                                                              Reproducible Floating Point Computation

                                                                                                                                                                                              Absolute Error for Random Vectors

                                                                                                                                                                                              Same magnitude opposite signs

                                                                                                                                                                                              Intel MKL non-reproducibility

                                                                                                                                                                                              Relative Error for Orthogonal vectors

                                                                                                                                                                                              Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

                                                                                                                                                                                              Sign notreproducible

                                                                                                                                                                                              103

                                                                                                                                                                                              bull Consider summation or dot productbull Goals

                                                                                                                                                                                              1 Same answer independent of layout processors order of summands

                                                                                                                                                                                              2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

                                                                                                                                                                                              bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

                                                                                                                                                                                              GoalsApproaches for Reproducibility

                                                                                                                                                                                              104

                                                                                                                                                                                              Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

                                                                                                                                                                                              Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

                                                                                                                                                                                              Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

                                                                                                                                                                                              bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

                                                                                                                                                                                              Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

                                                                                                                                                                                              bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

                                                                                                                                                                                              Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

                                                                                                                                                                                              Instruments NEC Nokia NVIDIA Samsung Oracle

                                                                                                                                                                                              bull bebopcsberkeleyedu

                                                                                                                                                                                              Summary

                                                                                                                                                                                              Donrsquot Communichellip

                                                                                                                                                                                              106

                                                                                                                                                                                              Time to redesign all linear algebra n-body hellip algorithms and software

                                                                                                                                                                                              (and compilers)

                                                                                                                                                                                              • Implementing Communication-Avoiding Algorithms
                                                                                                                                                                                              • Why avoid communication
                                                                                                                                                                                              • Goals
                                                                                                                                                                                              • Outline
                                                                                                                                                                                              • Outline (2)
                                                                                                                                                                                              • Lower bound for all ldquon3-likerdquo linear algebra
                                                                                                                                                                                              • Lower bound for all ldquon3-likerdquo linear algebra (2)
                                                                                                                                                                                              • Lower bound for all ldquon3-likerdquo linear algebra (3)
                                                                                                                                                                                              • Limits to parallel scaling (12)
                                                                                                                                                                                              • Limits to parallel scaling (22)
                                                                                                                                                                                              • Can we attain these lower bounds
                                                                                                                                                                                              • Outline (3)
                                                                                                                                                                                              • 25D Matrix Multiplication
                                                                                                                                                                                              • 25D Matrix Multiplication (2)
                                                                                                                                                                                              • 25D Matmul on BGP 16K nodes 64K cores (2)
                                                                                                                                                                                              • Perfect Strong Scaling ndash in Time and Energy (12)
                                                                                                                                                                                              • Perfect Strong Scaling ndash in Time and Energy (22)
                                                                                                                                                                                              • Handling Heterogeneity
                                                                                                                                                                                              • Application to Tensor Contractions
                                                                                                                                                                                              • C(ijk) = Σm A(ijm)B(mk)
                                                                                                                                                                                              • Application to Tensor Contractions (2)
                                                                                                                                                                                              • Communication Lower Bounds for Strassen-like matmul algorithms
                                                                                                                                                                                              • vs
                                                                                                                                                                                              • Slide 26
                                                                                                                                                                                              • Strassen-like beyond matmul
                                                                                                                                                                                              • Cache and Network Oblivious Algorithms
                                                                                                                                                                                              • CARMA Performance Distributed Memory
                                                                                                                                                                                              • CARMA Performance Distributed Memory (2)
                                                                                                                                                                                              • CARMA Performance Shared Memory
                                                                                                                                                                                              • CARMA Performance Shared Memory (2)
                                                                                                                                                                                              • Why is CARMA Faster in Shared Memory
                                                                                                                                                                                              • Outline (4)
                                                                                                                                                                                              • One-sided Factorizations (LU QR) so far
                                                                                                                                                                                              • TSQR An Architecture-Dependent Algorithm
                                                                                                                                                                                              • Back to LU Using similar idea for TSLU as TSQR Use reduction
                                                                                                                                                                                              • Minimizing Communication in TSLU
                                                                                                                                                                                              • Making TSLU Numerically Stable
                                                                                                                                                                                              • Stability of LU using TSLU CALU
                                                                                                                                                                                              • Why is stability of TSLU just a ldquoThmrdquo
                                                                                                                                                                                              • Fixing TSLU
                                                                                                                                                                                              • 2D CALU with Tournament Pivoting
                                                                                                                                                                                              • 25D CALU with Tournament Pivoting (c=4 copies)
                                                                                                                                                                                              • Exascale Machine Parameters Source DOE Exascale Workshop
                                                                                                                                                                                              • Exascale predicted speedups for Gaussian Elimination 2D CA
                                                                                                                                                                                              • 25D vs 2D LU With and Without Pivoting
                                                                                                                                                                                              • Other CA algorithms for Ax=b least squares(13)
                                                                                                                                                                                              • Other CA algorithms for Ax=b least squares (23)
                                                                                                                                                                                              • Other CA algorithms for Ax=b least squares (33)
                                                                                                                                                                                              • Outline (5)
                                                                                                                                                                                              • What about sparse matrices (13)
                                                                                                                                                                                              • Performance of 25D APSP using Kleene
                                                                                                                                                                                              • What about sparse matrices (23)
                                                                                                                                                                                              • What about sparse matrices (33)
                                                                                                                                                                                              • Outline (6)
                                                                                                                                                                                              • Symmetric Eigenproblem and SVD
                                                                                                                                                                                              • Slide 58
                                                                                                                                                                                              • Slide 59
                                                                                                                                                                                              • Slide 60
                                                                                                                                                                                              • Slide 61
                                                                                                                                                                                              • Slide 62
                                                                                                                                                                                              • Slide 63
                                                                                                                                                                                              • Slide 64
                                                                                                                                                                                              • Slide 65
                                                                                                                                                                                              • Slide 66
                                                                                                                                                                                              • Slide 67
                                                                                                                                                                                              • Slide 68
                                                                                                                                                                                              • Conventional vs CA - SBR
                                                                                                                                                                                              • Speedups of Sym Band Reduction vs DSBTRD
                                                                                                                                                                                              • Nonsymmetric Eigenproblem
                                                                                                                                                                                              • Attaining the Lower bounds Sequential
                                                                                                                                                                                              • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
                                                                                                                                                                                              • Outline (7)
                                                                                                                                                                                              • Avoiding Communication in Iterative Linear Algebra
                                                                                                                                                                                              • Outline (8)
                                                                                                                                                                                              • Example The Difficulty of Tuning SpMV
                                                                                                                                                                                              • Example The Difficulty of Tuning
                                                                                                                                                                                              • Speedups on Itanium 2 The Need for Search
                                                                                                                                                                                              • Register Profile Itanium 2
                                                                                                                                                                                              • Register Profiles IBM and Intel IA-64
                                                                                                                                                                                              • Another example of tuning challenges for SpMV
                                                                                                                                                                                              • Zoom in to top corner
                                                                                                                                                                                              • 3x3 blocks look natural buthellip
                                                                                                                                                                                              • Extra Work Can Improve Efficiency
                                                                                                                                                                                              • Slide 86
                                                                                                                                                                                              • Slide 87
                                                                                                                                                                                              • Slide 88
                                                                                                                                                                                              • Slide 89
                                                                                                                                                                                              • Summary of Other Performance Optimizations
                                                                                                                                                                                              • Optimized Sparse Kernel Interface - OSKI
                                                                                                                                                                                              • Outline (9)
                                                                                                                                                                                              • Example Classical Conjugate Gradient (CG)
                                                                                                                                                                                              • Example CA-Conjugate Gradient
                                                                                                                                                                                              • Outline (10)
                                                                                                                                                                                              • Slide 96
                                                                                                                                                                                              • Slide 97
                                                                                                                                                                                              • Outline (11)
                                                                                                                                                                                              • What is a ldquosparse matrixrdquo
                                                                                                                                                                                              • Outline (12)
                                                                                                                                                                                              • Reproducible Floating Point Computation
                                                                                                                                                                                              • Intel MKL non-reproducibility
                                                                                                                                                                                              • GoalsApproaches for Reproducibility
                                                                                                                                                                                              • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
                                                                                                                                                                                              • Collaborators and Supporters
                                                                                                                                                                                              • Summary

                                                                                                                                                                                                Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                                                                                                ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                                                                                                ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                                                                                                bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                                                                                                bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                                                                                                What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

                                                                                                                                                                                                bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

                                                                                                                                                                                                bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

                                                                                                                                                                                                matrices

                                                                                                                                                                                                Explicit (O(nnz)) Implicit (o(nnz))

                                                                                                                                                                                                Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

                                                                                                                                                                                                Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

                                                                                                                                                                                                Indices

                                                                                                                                                                                                Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                                                                                                ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                                                                                                ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                                                                                                bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                                                                                                bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                                                                                                101

                                                                                                                                                                                                bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

                                                                                                                                                                                                ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

                                                                                                                                                                                                demanded by customers (construction engineers) otherwise they donrsquot believe results

                                                                                                                                                                                                ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

                                                                                                                                                                                                ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

                                                                                                                                                                                                Reproducible Floating Point Computation

                                                                                                                                                                                                Absolute Error for Random Vectors

                                                                                                                                                                                                Same magnitude opposite signs

                                                                                                                                                                                                Intel MKL non-reproducibility

                                                                                                                                                                                                Relative Error for Orthogonal vectors

                                                                                                                                                                                                Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

                                                                                                                                                                                                Sign notreproducible

                                                                                                                                                                                                103

                                                                                                                                                                                                bull Consider summation or dot productbull Goals

                                                                                                                                                                                                1 Same answer independent of layout processors order of summands

                                                                                                                                                                                                2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

                                                                                                                                                                                                bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

                                                                                                                                                                                                GoalsApproaches for Reproducibility

                                                                                                                                                                                                104

                                                                                                                                                                                                Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

                                                                                                                                                                                                Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

                                                                                                                                                                                                Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

                                                                                                                                                                                                bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

                                                                                                                                                                                                Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

                                                                                                                                                                                                bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

                                                                                                                                                                                                Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

                                                                                                                                                                                                Instruments NEC Nokia NVIDIA Samsung Oracle

                                                                                                                                                                                                bull bebopcsberkeleyedu

                                                                                                                                                                                                Summary

                                                                                                                                                                                                Donrsquot Communichellip

                                                                                                                                                                                                106

                                                                                                                                                                                                Time to redesign all linear algebra n-body hellip algorithms and software

                                                                                                                                                                                                (and compilers)

                                                                                                                                                                                                • Implementing Communication-Avoiding Algorithms
                                                                                                                                                                                                • Why avoid communication
                                                                                                                                                                                                • Goals
                                                                                                                                                                                                • Outline
                                                                                                                                                                                                • Outline (2)
                                                                                                                                                                                                • Lower bound for all ldquon3-likerdquo linear algebra
                                                                                                                                                                                                • Lower bound for all ldquon3-likerdquo linear algebra (2)
                                                                                                                                                                                                • Lower bound for all ldquon3-likerdquo linear algebra (3)
                                                                                                                                                                                                • Limits to parallel scaling (12)
                                                                                                                                                                                                • Limits to parallel scaling (22)
                                                                                                                                                                                                • Can we attain these lower bounds
                                                                                                                                                                                                • Outline (3)
                                                                                                                                                                                                • 25D Matrix Multiplication
                                                                                                                                                                                                • 25D Matrix Multiplication (2)
                                                                                                                                                                                                • 25D Matmul on BGP 16K nodes 64K cores (2)
                                                                                                                                                                                                • Perfect Strong Scaling ndash in Time and Energy (12)
                                                                                                                                                                                                • Perfect Strong Scaling ndash in Time and Energy (22)
                                                                                                                                                                                                • Handling Heterogeneity
                                                                                                                                                                                                • Application to Tensor Contractions
                                                                                                                                                                                                • C(ijk) = Σm A(ijm)B(mk)
                                                                                                                                                                                                • Application to Tensor Contractions (2)
                                                                                                                                                                                                • Communication Lower Bounds for Strassen-like matmul algorithms
                                                                                                                                                                                                • vs
                                                                                                                                                                                                • Slide 26
                                                                                                                                                                                                • Strassen-like beyond matmul
                                                                                                                                                                                                • Cache and Network Oblivious Algorithms
                                                                                                                                                                                                • CARMA Performance Distributed Memory
                                                                                                                                                                                                • CARMA Performance Distributed Memory (2)
                                                                                                                                                                                                • CARMA Performance Shared Memory
                                                                                                                                                                                                • CARMA Performance Shared Memory (2)
                                                                                                                                                                                                • Why is CARMA Faster in Shared Memory
                                                                                                                                                                                                • Outline (4)
                                                                                                                                                                                                • One-sided Factorizations (LU QR) so far
                                                                                                                                                                                                • TSQR An Architecture-Dependent Algorithm
                                                                                                                                                                                                • Back to LU Using similar idea for TSLU as TSQR Use reduction
                                                                                                                                                                                                • Minimizing Communication in TSLU
                                                                                                                                                                                                • Making TSLU Numerically Stable
                                                                                                                                                                                                • Stability of LU using TSLU CALU
                                                                                                                                                                                                • Why is stability of TSLU just a ldquoThmrdquo
                                                                                                                                                                                                • Fixing TSLU
                                                                                                                                                                                                • 2D CALU with Tournament Pivoting
                                                                                                                                                                                                • 25D CALU with Tournament Pivoting (c=4 copies)
                                                                                                                                                                                                • Exascale Machine Parameters Source DOE Exascale Workshop
                                                                                                                                                                                                • Exascale predicted speedups for Gaussian Elimination 2D CA
                                                                                                                                                                                                • 25D vs 2D LU With and Without Pivoting
                                                                                                                                                                                                • Other CA algorithms for Ax=b least squares(13)
                                                                                                                                                                                                • Other CA algorithms for Ax=b least squares (23)
                                                                                                                                                                                                • Other CA algorithms for Ax=b least squares (33)
                                                                                                                                                                                                • Outline (5)
                                                                                                                                                                                                • What about sparse matrices (13)
                                                                                                                                                                                                • Performance of 25D APSP using Kleene
                                                                                                                                                                                                • What about sparse matrices (23)
                                                                                                                                                                                                • What about sparse matrices (33)
                                                                                                                                                                                                • Outline (6)
                                                                                                                                                                                                • Symmetric Eigenproblem and SVD
                                                                                                                                                                                                • Slide 58
                                                                                                                                                                                                • Slide 59
                                                                                                                                                                                                • Slide 60
                                                                                                                                                                                                • Slide 61
                                                                                                                                                                                                • Slide 62
                                                                                                                                                                                                • Slide 63
                                                                                                                                                                                                • Slide 64
                                                                                                                                                                                                • Slide 65
                                                                                                                                                                                                • Slide 66
                                                                                                                                                                                                • Slide 67
                                                                                                                                                                                                • Slide 68
                                                                                                                                                                                                • Conventional vs CA - SBR
                                                                                                                                                                                                • Speedups of Sym Band Reduction vs DSBTRD
                                                                                                                                                                                                • Nonsymmetric Eigenproblem
                                                                                                                                                                                                • Attaining the Lower bounds Sequential
                                                                                                                                                                                                • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
                                                                                                                                                                                                • Outline (7)
                                                                                                                                                                                                • Avoiding Communication in Iterative Linear Algebra
                                                                                                                                                                                                • Outline (8)
                                                                                                                                                                                                • Example The Difficulty of Tuning SpMV
                                                                                                                                                                                                • Example The Difficulty of Tuning
                                                                                                                                                                                                • Speedups on Itanium 2 The Need for Search
                                                                                                                                                                                                • Register Profile Itanium 2
                                                                                                                                                                                                • Register Profiles IBM and Intel IA-64
                                                                                                                                                                                                • Another example of tuning challenges for SpMV
                                                                                                                                                                                                • Zoom in to top corner
                                                                                                                                                                                                • 3x3 blocks look natural buthellip
                                                                                                                                                                                                • Extra Work Can Improve Efficiency
                                                                                                                                                                                                • Slide 86
                                                                                                                                                                                                • Slide 87
                                                                                                                                                                                                • Slide 88
                                                                                                                                                                                                • Slide 89
                                                                                                                                                                                                • Summary of Other Performance Optimizations
                                                                                                                                                                                                • Optimized Sparse Kernel Interface - OSKI
                                                                                                                                                                                                • Outline (9)
                                                                                                                                                                                                • Example Classical Conjugate Gradient (CG)
                                                                                                                                                                                                • Example CA-Conjugate Gradient
                                                                                                                                                                                                • Outline (10)
                                                                                                                                                                                                • Slide 96
                                                                                                                                                                                                • Slide 97
                                                                                                                                                                                                • Outline (11)
                                                                                                                                                                                                • What is a ldquosparse matrixrdquo
                                                                                                                                                                                                • Outline (12)
                                                                                                                                                                                                • Reproducible Floating Point Computation
                                                                                                                                                                                                • Intel MKL non-reproducibility
                                                                                                                                                                                                • GoalsApproaches for Reproducibility
                                                                                                                                                                                                • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
                                                                                                                                                                                                • Collaborators and Supporters
                                                                                                                                                                                                • Summary

                                                                                                                                                                                                  What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

                                                                                                                                                                                                  bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

                                                                                                                                                                                                  bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

                                                                                                                                                                                                  matrices

                                                                                                                                                                                                  Explicit (O(nnz)) Implicit (o(nnz))

                                                                                                                                                                                                  Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

                                                                                                                                                                                                  Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

                                                                                                                                                                                                  Indices

                                                                                                                                                                                                  Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                                                                                                  ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                                                                                                  ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                                                                                                  bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                                                                                                  bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                                                                                                  101

                                                                                                                                                                                                  bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

                                                                                                                                                                                                  ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

                                                                                                                                                                                                  demanded by customers (construction engineers) otherwise they donrsquot believe results

                                                                                                                                                                                                  ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

                                                                                                                                                                                                  ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

                                                                                                                                                                                                  Reproducible Floating Point Computation

                                                                                                                                                                                                  Absolute Error for Random Vectors

                                                                                                                                                                                                  Same magnitude opposite signs

                                                                                                                                                                                                  Intel MKL non-reproducibility

                                                                                                                                                                                                  Relative Error for Orthogonal vectors

                                                                                                                                                                                                  Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

                                                                                                                                                                                                  Sign notreproducible

                                                                                                                                                                                                  103

                                                                                                                                                                                                  bull Consider summation or dot productbull Goals

                                                                                                                                                                                                  1 Same answer independent of layout processors order of summands

                                                                                                                                                                                                  2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

                                                                                                                                                                                                  bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

                                                                                                                                                                                                  GoalsApproaches for Reproducibility

                                                                                                                                                                                                  104

                                                                                                                                                                                                  Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

                                                                                                                                                                                                  Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

                                                                                                                                                                                                  Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

                                                                                                                                                                                                  bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

                                                                                                                                                                                                  Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

                                                                                                                                                                                                  bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

                                                                                                                                                                                                  Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

                                                                                                                                                                                                  Instruments NEC Nokia NVIDIA Samsung Oracle

                                                                                                                                                                                                  bull bebopcsberkeleyedu

                                                                                                                                                                                                  Summary

                                                                                                                                                                                                  Donrsquot Communichellip

                                                                                                                                                                                                  106

                                                                                                                                                                                                  Time to redesign all linear algebra n-body hellip algorithms and software

                                                                                                                                                                                                  (and compilers)

                                                                                                                                                                                                  • Implementing Communication-Avoiding Algorithms
                                                                                                                                                                                                  • Why avoid communication
                                                                                                                                                                                                  • Goals
                                                                                                                                                                                                  • Outline
                                                                                                                                                                                                  • Outline (2)
                                                                                                                                                                                                  • Lower bound for all ldquon3-likerdquo linear algebra
                                                                                                                                                                                                  • Lower bound for all ldquon3-likerdquo linear algebra (2)
                                                                                                                                                                                                  • Lower bound for all ldquon3-likerdquo linear algebra (3)
                                                                                                                                                                                                  • Limits to parallel scaling (12)
                                                                                                                                                                                                  • Limits to parallel scaling (22)
                                                                                                                                                                                                  • Can we attain these lower bounds
                                                                                                                                                                                                  • Outline (3)
                                                                                                                                                                                                  • 25D Matrix Multiplication
                                                                                                                                                                                                  • 25D Matrix Multiplication (2)
                                                                                                                                                                                                  • 25D Matmul on BGP 16K nodes 64K cores (2)
                                                                                                                                                                                                  • Perfect Strong Scaling ndash in Time and Energy (12)
                                                                                                                                                                                                  • Perfect Strong Scaling ndash in Time and Energy (22)
                                                                                                                                                                                                  • Handling Heterogeneity
                                                                                                                                                                                                  • Application to Tensor Contractions
                                                                                                                                                                                                  • C(ijk) = Σm A(ijm)B(mk)
                                                                                                                                                                                                  • Application to Tensor Contractions (2)
                                                                                                                                                                                                  • Communication Lower Bounds for Strassen-like matmul algorithms
                                                                                                                                                                                                  • vs
                                                                                                                                                                                                  • Slide 26
                                                                                                                                                                                                  • Strassen-like beyond matmul
                                                                                                                                                                                                  • Cache and Network Oblivious Algorithms
                                                                                                                                                                                                  • CARMA Performance Distributed Memory
                                                                                                                                                                                                  • CARMA Performance Distributed Memory (2)
                                                                                                                                                                                                  • CARMA Performance Shared Memory
                                                                                                                                                                                                  • CARMA Performance Shared Memory (2)
                                                                                                                                                                                                  • Why is CARMA Faster in Shared Memory
                                                                                                                                                                                                  • Outline (4)
                                                                                                                                                                                                  • One-sided Factorizations (LU QR) so far
                                                                                                                                                                                                  • TSQR An Architecture-Dependent Algorithm
                                                                                                                                                                                                  • Back to LU Using similar idea for TSLU as TSQR Use reduction
                                                                                                                                                                                                  • Minimizing Communication in TSLU
                                                                                                                                                                                                  • Making TSLU Numerically Stable
                                                                                                                                                                                                  • Stability of LU using TSLU CALU
                                                                                                                                                                                                  • Why is stability of TSLU just a ldquoThmrdquo
                                                                                                                                                                                                  • Fixing TSLU
                                                                                                                                                                                                  • 2D CALU with Tournament Pivoting
                                                                                                                                                                                                  • 25D CALU with Tournament Pivoting (c=4 copies)
                                                                                                                                                                                                  • Exascale Machine Parameters Source DOE Exascale Workshop
                                                                                                                                                                                                  • Exascale predicted speedups for Gaussian Elimination 2D CA
                                                                                                                                                                                                  • 25D vs 2D LU With and Without Pivoting
                                                                                                                                                                                                  • Other CA algorithms for Ax=b least squares(13)
                                                                                                                                                                                                  • Other CA algorithms for Ax=b least squares (23)
                                                                                                                                                                                                  • Other CA algorithms for Ax=b least squares (33)
                                                                                                                                                                                                  • Outline (5)
                                                                                                                                                                                                  • What about sparse matrices (13)
                                                                                                                                                                                                  • Performance of 25D APSP using Kleene
                                                                                                                                                                                                  • What about sparse matrices (23)
                                                                                                                                                                                                  • What about sparse matrices (33)
                                                                                                                                                                                                  • Outline (6)
                                                                                                                                                                                                  • Symmetric Eigenproblem and SVD
                                                                                                                                                                                                  • Slide 58
                                                                                                                                                                                                  • Slide 59
                                                                                                                                                                                                  • Slide 60
                                                                                                                                                                                                  • Slide 61
                                                                                                                                                                                                  • Slide 62
                                                                                                                                                                                                  • Slide 63
                                                                                                                                                                                                  • Slide 64
                                                                                                                                                                                                  • Slide 65
                                                                                                                                                                                                  • Slide 66
                                                                                                                                                                                                  • Slide 67
                                                                                                                                                                                                  • Slide 68
                                                                                                                                                                                                  • Conventional vs CA - SBR
                                                                                                                                                                                                  • Speedups of Sym Band Reduction vs DSBTRD
                                                                                                                                                                                                  • Nonsymmetric Eigenproblem
                                                                                                                                                                                                  • Attaining the Lower bounds Sequential
                                                                                                                                                                                                  • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
                                                                                                                                                                                                  • Outline (7)
                                                                                                                                                                                                  • Avoiding Communication in Iterative Linear Algebra
                                                                                                                                                                                                  • Outline (8)
                                                                                                                                                                                                  • Example The Difficulty of Tuning SpMV
                                                                                                                                                                                                  • Example The Difficulty of Tuning
                                                                                                                                                                                                  • Speedups on Itanium 2 The Need for Search
                                                                                                                                                                                                  • Register Profile Itanium 2
                                                                                                                                                                                                  • Register Profiles IBM and Intel IA-64
                                                                                                                                                                                                  • Another example of tuning challenges for SpMV
                                                                                                                                                                                                  • Zoom in to top corner
                                                                                                                                                                                                  • 3x3 blocks look natural buthellip
                                                                                                                                                                                                  • Extra Work Can Improve Efficiency
                                                                                                                                                                                                  • Slide 86
                                                                                                                                                                                                  • Slide 87
                                                                                                                                                                                                  • Slide 88
                                                                                                                                                                                                  • Slide 89
                                                                                                                                                                                                  • Summary of Other Performance Optimizations
                                                                                                                                                                                                  • Optimized Sparse Kernel Interface - OSKI
                                                                                                                                                                                                  • Outline (9)
                                                                                                                                                                                                  • Example Classical Conjugate Gradient (CG)
                                                                                                                                                                                                  • Example CA-Conjugate Gradient
                                                                                                                                                                                                  • Outline (10)
                                                                                                                                                                                                  • Slide 96
                                                                                                                                                                                                  • Slide 97
                                                                                                                                                                                                  • Outline (11)
                                                                                                                                                                                                  • What is a ldquosparse matrixrdquo
                                                                                                                                                                                                  • Outline (12)
                                                                                                                                                                                                  • Reproducible Floating Point Computation
                                                                                                                                                                                                  • Intel MKL non-reproducibility
                                                                                                                                                                                                  • GoalsApproaches for Reproducibility
                                                                                                                                                                                                  • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
                                                                                                                                                                                                  • Collaborators and Supporters
                                                                                                                                                                                                  • Summary

                                                                                                                                                                                                    Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

                                                                                                                                                                                                    ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

                                                                                                                                                                                                    ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

                                                                                                                                                                                                    bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

                                                                                                                                                                                                    bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

                                                                                                                                                                                                    101

                                                                                                                                                                                                    bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

                                                                                                                                                                                                    ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

                                                                                                                                                                                                    demanded by customers (construction engineers) otherwise they donrsquot believe results

                                                                                                                                                                                                    ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

                                                                                                                                                                                                    ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

                                                                                                                                                                                                    Reproducible Floating Point Computation

                                                                                                                                                                                                    Absolute Error for Random Vectors

                                                                                                                                                                                                    Same magnitude opposite signs

                                                                                                                                                                                                    Intel MKL non-reproducibility

                                                                                                                                                                                                    Relative Error for Orthogonal vectors

                                                                                                                                                                                                    Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

                                                                                                                                                                                                    Sign notreproducible

                                                                                                                                                                                                    103

                                                                                                                                                                                                    bull Consider summation or dot productbull Goals

                                                                                                                                                                                                    1 Same answer independent of layout processors order of summands

                                                                                                                                                                                                    2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

                                                                                                                                                                                                    bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

                                                                                                                                                                                                    GoalsApproaches for Reproducibility

                                                                                                                                                                                                    104

                                                                                                                                                                                                    Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

                                                                                                                                                                                                    Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

                                                                                                                                                                                                    Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

                                                                                                                                                                                                    bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

                                                                                                                                                                                                    Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

                                                                                                                                                                                                    bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

                                                                                                                                                                                                    Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

                                                                                                                                                                                                    Instruments NEC Nokia NVIDIA Samsung Oracle

                                                                                                                                                                                                    bull bebopcsberkeleyedu

                                                                                                                                                                                                    Summary

                                                                                                                                                                                                    Donrsquot Communichellip

                                                                                                                                                                                                    106

                                                                                                                                                                                                    Time to redesign all linear algebra n-body hellip algorithms and software

                                                                                                                                                                                                    (and compilers)

                                                                                                                                                                                                    • Implementing Communication-Avoiding Algorithms
                                                                                                                                                                                                    • Why avoid communication
                                                                                                                                                                                                    • Goals
                                                                                                                                                                                                    • Outline
                                                                                                                                                                                                    • Outline (2)
                                                                                                                                                                                                    • Lower bound for all ldquon3-likerdquo linear algebra
                                                                                                                                                                                                    • Lower bound for all ldquon3-likerdquo linear algebra (2)
                                                                                                                                                                                                    • Lower bound for all ldquon3-likerdquo linear algebra (3)
                                                                                                                                                                                                    • Limits to parallel scaling (12)
                                                                                                                                                                                                    • Limits to parallel scaling (22)
                                                                                                                                                                                                    • Can we attain these lower bounds
                                                                                                                                                                                                    • Outline (3)
                                                                                                                                                                                                    • 25D Matrix Multiplication
                                                                                                                                                                                                    • 25D Matrix Multiplication (2)
                                                                                                                                                                                                    • 25D Matmul on BGP 16K nodes 64K cores (2)
                                                                                                                                                                                                    • Perfect Strong Scaling ndash in Time and Energy (12)
                                                                                                                                                                                                    • Perfect Strong Scaling ndash in Time and Energy (22)
                                                                                                                                                                                                    • Handling Heterogeneity
                                                                                                                                                                                                    • Application to Tensor Contractions
                                                                                                                                                                                                    • C(ijk) = Σm A(ijm)B(mk)
                                                                                                                                                                                                    • Application to Tensor Contractions (2)
                                                                                                                                                                                                    • Communication Lower Bounds for Strassen-like matmul algorithms
                                                                                                                                                                                                    • vs
                                                                                                                                                                                                    • Slide 26
                                                                                                                                                                                                    • Strassen-like beyond matmul
                                                                                                                                                                                                    • Cache and Network Oblivious Algorithms
                                                                                                                                                                                                    • CARMA Performance Distributed Memory
                                                                                                                                                                                                    • CARMA Performance Distributed Memory (2)
                                                                                                                                                                                                    • CARMA Performance Shared Memory
                                                                                                                                                                                                    • CARMA Performance Shared Memory (2)
                                                                                                                                                                                                    • Why is CARMA Faster in Shared Memory
                                                                                                                                                                                                    • Outline (4)
                                                                                                                                                                                                    • One-sided Factorizations (LU QR) so far
                                                                                                                                                                                                    • TSQR An Architecture-Dependent Algorithm
                                                                                                                                                                                                    • Back to LU Using similar idea for TSLU as TSQR Use reduction
                                                                                                                                                                                                    • Minimizing Communication in TSLU
                                                                                                                                                                                                    • Making TSLU Numerically Stable
                                                                                                                                                                                                    • Stability of LU using TSLU CALU
                                                                                                                                                                                                    • Why is stability of TSLU just a ldquoThmrdquo
                                                                                                                                                                                                    • Fixing TSLU
                                                                                                                                                                                                    • 2D CALU with Tournament Pivoting
                                                                                                                                                                                                    • 25D CALU with Tournament Pivoting (c=4 copies)
                                                                                                                                                                                                    • Exascale Machine Parameters Source DOE Exascale Workshop
                                                                                                                                                                                                    • Exascale predicted speedups for Gaussian Elimination 2D CA
                                                                                                                                                                                                    • 25D vs 2D LU With and Without Pivoting
                                                                                                                                                                                                    • Other CA algorithms for Ax=b least squares(13)
                                                                                                                                                                                                    • Other CA algorithms for Ax=b least squares (23)
                                                                                                                                                                                                    • Other CA algorithms for Ax=b least squares (33)
                                                                                                                                                                                                    • Outline (5)
                                                                                                                                                                                                    • What about sparse matrices (13)
                                                                                                                                                                                                    • Performance of 25D APSP using Kleene
                                                                                                                                                                                                    • What about sparse matrices (23)
                                                                                                                                                                                                    • What about sparse matrices (33)
                                                                                                                                                                                                    • Outline (6)
                                                                                                                                                                                                    • Symmetric Eigenproblem and SVD
                                                                                                                                                                                                    • Slide 58
                                                                                                                                                                                                    • Slide 59
                                                                                                                                                                                                    • Slide 60
                                                                                                                                                                                                    • Slide 61
                                                                                                                                                                                                    • Slide 62
                                                                                                                                                                                                    • Slide 63
                                                                                                                                                                                                    • Slide 64
                                                                                                                                                                                                    • Slide 65
                                                                                                                                                                                                    • Slide 66
                                                                                                                                                                                                    • Slide 67
                                                                                                                                                                                                    • Slide 68
                                                                                                                                                                                                    • Conventional vs CA - SBR
                                                                                                                                                                                                    • Speedups of Sym Band Reduction vs DSBTRD
                                                                                                                                                                                                    • Nonsymmetric Eigenproblem
                                                                                                                                                                                                    • Attaining the Lower bounds Sequential
                                                                                                                                                                                                    • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
                                                                                                                                                                                                    • Outline (7)
                                                                                                                                                                                                    • Avoiding Communication in Iterative Linear Algebra
                                                                                                                                                                                                    • Outline (8)
                                                                                                                                                                                                    • Example The Difficulty of Tuning SpMV
                                                                                                                                                                                                    • Example The Difficulty of Tuning
                                                                                                                                                                                                    • Speedups on Itanium 2 The Need for Search
                                                                                                                                                                                                    • Register Profile Itanium 2
                                                                                                                                                                                                    • Register Profiles IBM and Intel IA-64
                                                                                                                                                                                                    • Another example of tuning challenges for SpMV
                                                                                                                                                                                                    • Zoom in to top corner
                                                                                                                                                                                                    • 3x3 blocks look natural buthellip
                                                                                                                                                                                                    • Extra Work Can Improve Efficiency
                                                                                                                                                                                                    • Slide 86
                                                                                                                                                                                                    • Slide 87
                                                                                                                                                                                                    • Slide 88
                                                                                                                                                                                                    • Slide 89
                                                                                                                                                                                                    • Summary of Other Performance Optimizations
                                                                                                                                                                                                    • Optimized Sparse Kernel Interface - OSKI
                                                                                                                                                                                                    • Outline (9)
                                                                                                                                                                                                    • Example Classical Conjugate Gradient (CG)
                                                                                                                                                                                                    • Example CA-Conjugate Gradient
                                                                                                                                                                                                    • Outline (10)
                                                                                                                                                                                                    • Slide 96
                                                                                                                                                                                                    • Slide 97
                                                                                                                                                                                                    • Outline (11)
                                                                                                                                                                                                    • What is a ldquosparse matrixrdquo
                                                                                                                                                                                                    • Outline (12)
                                                                                                                                                                                                    • Reproducible Floating Point Computation
                                                                                                                                                                                                    • Intel MKL non-reproducibility
                                                                                                                                                                                                    • GoalsApproaches for Reproducibility
                                                                                                                                                                                                    • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
                                                                                                                                                                                                    • Collaborators and Supporters
                                                                                                                                                                                                    • Summary

                                                                                                                                                                                                      101

                                                                                                                                                                                                      bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

                                                                                                                                                                                                      ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

                                                                                                                                                                                                      demanded by customers (construction engineers) otherwise they donrsquot believe results

                                                                                                                                                                                                      ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

                                                                                                                                                                                                      ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

                                                                                                                                                                                                      Reproducible Floating Point Computation

                                                                                                                                                                                                      Absolute Error for Random Vectors

                                                                                                                                                                                                      Same magnitude opposite signs

                                                                                                                                                                                                      Intel MKL non-reproducibility

                                                                                                                                                                                                      Relative Error for Orthogonal vectors

                                                                                                                                                                                                      Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

                                                                                                                                                                                                      Sign notreproducible

                                                                                                                                                                                                      103

                                                                                                                                                                                                      bull Consider summation or dot productbull Goals

                                                                                                                                                                                                      1 Same answer independent of layout processors order of summands

                                                                                                                                                                                                      2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

                                                                                                                                                                                                      bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

                                                                                                                                                                                                      GoalsApproaches for Reproducibility

                                                                                                                                                                                                      104

                                                                                                                                                                                                      Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

                                                                                                                                                                                                      Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

                                                                                                                                                                                                      Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

                                                                                                                                                                                                      bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

                                                                                                                                                                                                      Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

                                                                                                                                                                                                      bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

                                                                                                                                                                                                      Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

                                                                                                                                                                                                      Instruments NEC Nokia NVIDIA Samsung Oracle

                                                                                                                                                                                                      bull bebopcsberkeleyedu

                                                                                                                                                                                                      Summary

                                                                                                                                                                                                      Donrsquot Communichellip

                                                                                                                                                                                                      106

                                                                                                                                                                                                      Time to redesign all linear algebra n-body hellip algorithms and software

                                                                                                                                                                                                      (and compilers)

                                                                                                                                                                                                      • Implementing Communication-Avoiding Algorithms
                                                                                                                                                                                                      • Why avoid communication
                                                                                                                                                                                                      • Goals
                                                                                                                                                                                                      • Outline
                                                                                                                                                                                                      • Outline (2)
                                                                                                                                                                                                      • Lower bound for all ldquon3-likerdquo linear algebra
                                                                                                                                                                                                      • Lower bound for all ldquon3-likerdquo linear algebra (2)
                                                                                                                                                                                                      • Lower bound for all ldquon3-likerdquo linear algebra (3)
                                                                                                                                                                                                      • Limits to parallel scaling (12)
                                                                                                                                                                                                      • Limits to parallel scaling (22)
                                                                                                                                                                                                      • Can we attain these lower bounds
                                                                                                                                                                                                      • Outline (3)
                                                                                                                                                                                                      • 25D Matrix Multiplication
                                                                                                                                                                                                      • 25D Matrix Multiplication (2)
                                                                                                                                                                                                      • 25D Matmul on BGP 16K nodes 64K cores (2)
                                                                                                                                                                                                      • Perfect Strong Scaling ndash in Time and Energy (12)
                                                                                                                                                                                                      • Perfect Strong Scaling ndash in Time and Energy (22)
                                                                                                                                                                                                      • Handling Heterogeneity
                                                                                                                                                                                                      • Application to Tensor Contractions
                                                                                                                                                                                                      • C(ijk) = Σm A(ijm)B(mk)
                                                                                                                                                                                                      • Application to Tensor Contractions (2)
                                                                                                                                                                                                      • Communication Lower Bounds for Strassen-like matmul algorithms
                                                                                                                                                                                                      • vs
                                                                                                                                                                                                      • Slide 26
                                                                                                                                                                                                      • Strassen-like beyond matmul
                                                                                                                                                                                                      • Cache and Network Oblivious Algorithms
                                                                                                                                                                                                      • CARMA Performance Distributed Memory
                                                                                                                                                                                                      • CARMA Performance Distributed Memory (2)
                                                                                                                                                                                                      • CARMA Performance Shared Memory
                                                                                                                                                                                                      • CARMA Performance Shared Memory (2)
                                                                                                                                                                                                      • Why is CARMA Faster in Shared Memory
                                                                                                                                                                                                      • Outline (4)
                                                                                                                                                                                                      • One-sided Factorizations (LU QR) so far
                                                                                                                                                                                                      • TSQR An Architecture-Dependent Algorithm
                                                                                                                                                                                                      • Back to LU Using similar idea for TSLU as TSQR Use reduction
                                                                                                                                                                                                      • Minimizing Communication in TSLU
                                                                                                                                                                                                      • Making TSLU Numerically Stable
                                                                                                                                                                                                      • Stability of LU using TSLU CALU
                                                                                                                                                                                                      • Why is stability of TSLU just a ldquoThmrdquo
                                                                                                                                                                                                      • Fixing TSLU
                                                                                                                                                                                                      • 2D CALU with Tournament Pivoting
                                                                                                                                                                                                      • 25D CALU with Tournament Pivoting (c=4 copies)
                                                                                                                                                                                                      • Exascale Machine Parameters Source DOE Exascale Workshop
                                                                                                                                                                                                      • Exascale predicted speedups for Gaussian Elimination 2D CA
                                                                                                                                                                                                      • 25D vs 2D LU With and Without Pivoting
                                                                                                                                                                                                      • Other CA algorithms for Ax=b least squares(13)
                                                                                                                                                                                                      • Other CA algorithms for Ax=b least squares (23)
                                                                                                                                                                                                      • Other CA algorithms for Ax=b least squares (33)
                                                                                                                                                                                                      • Outline (5)
                                                                                                                                                                                                      • What about sparse matrices (13)
                                                                                                                                                                                                      • Performance of 25D APSP using Kleene
                                                                                                                                                                                                      • What about sparse matrices (23)
                                                                                                                                                                                                      • What about sparse matrices (33)
                                                                                                                                                                                                      • Outline (6)
                                                                                                                                                                                                      • Symmetric Eigenproblem and SVD
                                                                                                                                                                                                      • Slide 58
                                                                                                                                                                                                      • Slide 59
                                                                                                                                                                                                      • Slide 60
                                                                                                                                                                                                      • Slide 61
                                                                                                                                                                                                      • Slide 62
                                                                                                                                                                                                      • Slide 63
                                                                                                                                                                                                      • Slide 64
                                                                                                                                                                                                      • Slide 65
                                                                                                                                                                                                      • Slide 66
                                                                                                                                                                                                      • Slide 67
                                                                                                                                                                                                      • Slide 68
                                                                                                                                                                                                      • Conventional vs CA - SBR
                                                                                                                                                                                                      • Speedups of Sym Band Reduction vs DSBTRD
                                                                                                                                                                                                      • Nonsymmetric Eigenproblem
                                                                                                                                                                                                      • Attaining the Lower bounds Sequential
                                                                                                                                                                                                      • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
                                                                                                                                                                                                      • Outline (7)
                                                                                                                                                                                                      • Avoiding Communication in Iterative Linear Algebra
                                                                                                                                                                                                      • Outline (8)
                                                                                                                                                                                                      • Example The Difficulty of Tuning SpMV
                                                                                                                                                                                                      • Example The Difficulty of Tuning
                                                                                                                                                                                                      • Speedups on Itanium 2 The Need for Search
                                                                                                                                                                                                      • Register Profile Itanium 2
                                                                                                                                                                                                      • Register Profiles IBM and Intel IA-64
                                                                                                                                                                                                      • Another example of tuning challenges for SpMV
                                                                                                                                                                                                      • Zoom in to top corner
                                                                                                                                                                                                      • 3x3 blocks look natural buthellip
                                                                                                                                                                                                      • Extra Work Can Improve Efficiency
                                                                                                                                                                                                      • Slide 86
                                                                                                                                                                                                      • Slide 87
                                                                                                                                                                                                      • Slide 88
                                                                                                                                                                                                      • Slide 89
                                                                                                                                                                                                      • Summary of Other Performance Optimizations
                                                                                                                                                                                                      • Optimized Sparse Kernel Interface - OSKI
                                                                                                                                                                                                      • Outline (9)
                                                                                                                                                                                                      • Example Classical Conjugate Gradient (CG)
                                                                                                                                                                                                      • Example CA-Conjugate Gradient
                                                                                                                                                                                                      • Outline (10)
                                                                                                                                                                                                      • Slide 96
                                                                                                                                                                                                      • Slide 97
                                                                                                                                                                                                      • Outline (11)
                                                                                                                                                                                                      • What is a ldquosparse matrixrdquo
                                                                                                                                                                                                      • Outline (12)
                                                                                                                                                                                                      • Reproducible Floating Point Computation
                                                                                                                                                                                                      • Intel MKL non-reproducibility
                                                                                                                                                                                                      • GoalsApproaches for Reproducibility
                                                                                                                                                                                                      • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
                                                                                                                                                                                                      • Collaborators and Supporters
                                                                                                                                                                                                      • Summary

                                                                                                                                                                                                        Absolute Error for Random Vectors

                                                                                                                                                                                                        Same magnitude opposite signs

                                                                                                                                                                                                        Intel MKL non-reproducibility

                                                                                                                                                                                                        Relative Error for Orthogonal vectors

                                                                                                                                                                                                        Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

                                                                                                                                                                                                        Sign notreproducible

                                                                                                                                                                                                        103

                                                                                                                                                                                                        bull Consider summation or dot productbull Goals

                                                                                                                                                                                                        1 Same answer independent of layout processors order of summands

                                                                                                                                                                                                        2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

                                                                                                                                                                                                        bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

                                                                                                                                                                                                        GoalsApproaches for Reproducibility

                                                                                                                                                                                                        104

                                                                                                                                                                                                        Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

                                                                                                                                                                                                        Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

                                                                                                                                                                                                        Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

                                                                                                                                                                                                        bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

                                                                                                                                                                                                        Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

                                                                                                                                                                                                        bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

                                                                                                                                                                                                        Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

                                                                                                                                                                                                        Instruments NEC Nokia NVIDIA Samsung Oracle

                                                                                                                                                                                                        bull bebopcsberkeleyedu

                                                                                                                                                                                                        Summary

                                                                                                                                                                                                        Donrsquot Communichellip

                                                                                                                                                                                                        106

                                                                                                                                                                                                        Time to redesign all linear algebra n-body hellip algorithms and software

                                                                                                                                                                                                        (and compilers)

                                                                                                                                                                                                        • Implementing Communication-Avoiding Algorithms
                                                                                                                                                                                                        • Why avoid communication
                                                                                                                                                                                                        • Goals
                                                                                                                                                                                                        • Outline
                                                                                                                                                                                                        • Outline (2)
                                                                                                                                                                                                        • Lower bound for all ldquon3-likerdquo linear algebra
                                                                                                                                                                                                        • Lower bound for all ldquon3-likerdquo linear algebra (2)
                                                                                                                                                                                                        • Lower bound for all ldquon3-likerdquo linear algebra (3)
                                                                                                                                                                                                        • Limits to parallel scaling (12)
                                                                                                                                                                                                        • Limits to parallel scaling (22)
                                                                                                                                                                                                        • Can we attain these lower bounds
                                                                                                                                                                                                        • Outline (3)
                                                                                                                                                                                                        • 25D Matrix Multiplication
                                                                                                                                                                                                        • 25D Matrix Multiplication (2)
                                                                                                                                                                                                        • 25D Matmul on BGP 16K nodes 64K cores (2)
                                                                                                                                                                                                        • Perfect Strong Scaling ndash in Time and Energy (12)
                                                                                                                                                                                                        • Perfect Strong Scaling ndash in Time and Energy (22)
                                                                                                                                                                                                        • Handling Heterogeneity
                                                                                                                                                                                                        • Application to Tensor Contractions
                                                                                                                                                                                                        • C(ijk) = Σm A(ijm)B(mk)
                                                                                                                                                                                                        • Application to Tensor Contractions (2)
                                                                                                                                                                                                        • Communication Lower Bounds for Strassen-like matmul algorithms
                                                                                                                                                                                                        • vs
                                                                                                                                                                                                        • Slide 26
                                                                                                                                                                                                        • Strassen-like beyond matmul
                                                                                                                                                                                                        • Cache and Network Oblivious Algorithms
                                                                                                                                                                                                        • CARMA Performance Distributed Memory
                                                                                                                                                                                                        • CARMA Performance Distributed Memory (2)
                                                                                                                                                                                                        • CARMA Performance Shared Memory
                                                                                                                                                                                                        • CARMA Performance Shared Memory (2)
                                                                                                                                                                                                        • Why is CARMA Faster in Shared Memory
                                                                                                                                                                                                        • Outline (4)
                                                                                                                                                                                                        • One-sided Factorizations (LU QR) so far
                                                                                                                                                                                                        • TSQR An Architecture-Dependent Algorithm
                                                                                                                                                                                                        • Back to LU Using similar idea for TSLU as TSQR Use reduction
                                                                                                                                                                                                        • Minimizing Communication in TSLU
                                                                                                                                                                                                        • Making TSLU Numerically Stable
                                                                                                                                                                                                        • Stability of LU using TSLU CALU
                                                                                                                                                                                                        • Why is stability of TSLU just a ldquoThmrdquo
                                                                                                                                                                                                        • Fixing TSLU
                                                                                                                                                                                                        • 2D CALU with Tournament Pivoting
                                                                                                                                                                                                        • 25D CALU with Tournament Pivoting (c=4 copies)
                                                                                                                                                                                                        • Exascale Machine Parameters Source DOE Exascale Workshop
                                                                                                                                                                                                        • Exascale predicted speedups for Gaussian Elimination 2D CA
                                                                                                                                                                                                        • 25D vs 2D LU With and Without Pivoting
                                                                                                                                                                                                        • Other CA algorithms for Ax=b least squares(13)
                                                                                                                                                                                                        • Other CA algorithms for Ax=b least squares (23)
                                                                                                                                                                                                        • Other CA algorithms for Ax=b least squares (33)
                                                                                                                                                                                                        • Outline (5)
                                                                                                                                                                                                        • What about sparse matrices (13)
                                                                                                                                                                                                        • Performance of 25D APSP using Kleene
                                                                                                                                                                                                        • What about sparse matrices (23)
                                                                                                                                                                                                        • What about sparse matrices (33)
                                                                                                                                                                                                        • Outline (6)
                                                                                                                                                                                                        • Symmetric Eigenproblem and SVD
                                                                                                                                                                                                        • Slide 58
                                                                                                                                                                                                        • Slide 59
                                                                                                                                                                                                        • Slide 60
                                                                                                                                                                                                        • Slide 61
                                                                                                                                                                                                        • Slide 62
                                                                                                                                                                                                        • Slide 63
                                                                                                                                                                                                        • Slide 64
                                                                                                                                                                                                        • Slide 65
                                                                                                                                                                                                        • Slide 66
                                                                                                                                                                                                        • Slide 67
                                                                                                                                                                                                        • Slide 68
                                                                                                                                                                                                        • Conventional vs CA - SBR
                                                                                                                                                                                                        • Speedups of Sym Band Reduction vs DSBTRD
                                                                                                                                                                                                        • Nonsymmetric Eigenproblem
                                                                                                                                                                                                        • Attaining the Lower bounds Sequential
                                                                                                                                                                                                        • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
                                                                                                                                                                                                        • Outline (7)
                                                                                                                                                                                                        • Avoiding Communication in Iterative Linear Algebra
                                                                                                                                                                                                        • Outline (8)
                                                                                                                                                                                                        • Example The Difficulty of Tuning SpMV
                                                                                                                                                                                                        • Example The Difficulty of Tuning
                                                                                                                                                                                                        • Speedups on Itanium 2 The Need for Search
                                                                                                                                                                                                        • Register Profile Itanium 2
                                                                                                                                                                                                        • Register Profiles IBM and Intel IA-64
                                                                                                                                                                                                        • Another example of tuning challenges for SpMV
                                                                                                                                                                                                        • Zoom in to top corner
                                                                                                                                                                                                        • 3x3 blocks look natural buthellip
                                                                                                                                                                                                        • Extra Work Can Improve Efficiency
                                                                                                                                                                                                        • Slide 86
                                                                                                                                                                                                        • Slide 87
                                                                                                                                                                                                        • Slide 88
                                                                                                                                                                                                        • Slide 89
                                                                                                                                                                                                        • Summary of Other Performance Optimizations
                                                                                                                                                                                                        • Optimized Sparse Kernel Interface - OSKI
                                                                                                                                                                                                        • Outline (9)
                                                                                                                                                                                                        • Example Classical Conjugate Gradient (CG)
                                                                                                                                                                                                        • Example CA-Conjugate Gradient
                                                                                                                                                                                                        • Outline (10)
                                                                                                                                                                                                        • Slide 96
                                                                                                                                                                                                        • Slide 97
                                                                                                                                                                                                        • Outline (11)
                                                                                                                                                                                                        • What is a ldquosparse matrixrdquo
                                                                                                                                                                                                        • Outline (12)
                                                                                                                                                                                                        • Reproducible Floating Point Computation
                                                                                                                                                                                                        • Intel MKL non-reproducibility
                                                                                                                                                                                                        • GoalsApproaches for Reproducibility
                                                                                                                                                                                                        • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
                                                                                                                                                                                                        • Collaborators and Supporters
                                                                                                                                                                                                        • Summary

                                                                                                                                                                                                          103

                                                                                                                                                                                                          bull Consider summation or dot productbull Goals

                                                                                                                                                                                                          1 Same answer independent of layout processors order of summands

                                                                                                                                                                                                          2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

                                                                                                                                                                                                          bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

                                                                                                                                                                                                          GoalsApproaches for Reproducibility

                                                                                                                                                                                                          104

                                                                                                                                                                                                          Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

                                                                                                                                                                                                          Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

                                                                                                                                                                                                          Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

                                                                                                                                                                                                          bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

                                                                                                                                                                                                          Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

                                                                                                                                                                                                          bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

                                                                                                                                                                                                          Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

                                                                                                                                                                                                          Instruments NEC Nokia NVIDIA Samsung Oracle

                                                                                                                                                                                                          bull bebopcsberkeleyedu

                                                                                                                                                                                                          Summary

                                                                                                                                                                                                          Donrsquot Communichellip

                                                                                                                                                                                                          106

                                                                                                                                                                                                          Time to redesign all linear algebra n-body hellip algorithms and software

                                                                                                                                                                                                          (and compilers)

                                                                                                                                                                                                          • Implementing Communication-Avoiding Algorithms
                                                                                                                                                                                                          • Why avoid communication
                                                                                                                                                                                                          • Goals
                                                                                                                                                                                                          • Outline
                                                                                                                                                                                                          • Outline (2)
                                                                                                                                                                                                          • Lower bound for all ldquon3-likerdquo linear algebra
                                                                                                                                                                                                          • Lower bound for all ldquon3-likerdquo linear algebra (2)
                                                                                                                                                                                                          • Lower bound for all ldquon3-likerdquo linear algebra (3)
                                                                                                                                                                                                          • Limits to parallel scaling (12)
                                                                                                                                                                                                          • Limits to parallel scaling (22)
                                                                                                                                                                                                          • Can we attain these lower bounds
                                                                                                                                                                                                          • Outline (3)
                                                                                                                                                                                                          • 25D Matrix Multiplication
                                                                                                                                                                                                          • 25D Matrix Multiplication (2)
                                                                                                                                                                                                          • 25D Matmul on BGP 16K nodes 64K cores (2)
                                                                                                                                                                                                          • Perfect Strong Scaling ndash in Time and Energy (12)
                                                                                                                                                                                                          • Perfect Strong Scaling ndash in Time and Energy (22)
                                                                                                                                                                                                          • Handling Heterogeneity
                                                                                                                                                                                                          • Application to Tensor Contractions
                                                                                                                                                                                                          • C(ijk) = Σm A(ijm)B(mk)
                                                                                                                                                                                                          • Application to Tensor Contractions (2)
                                                                                                                                                                                                          • Communication Lower Bounds for Strassen-like matmul algorithms
                                                                                                                                                                                                          • vs
                                                                                                                                                                                                          • Slide 26
                                                                                                                                                                                                          • Strassen-like beyond matmul
                                                                                                                                                                                                          • Cache and Network Oblivious Algorithms
                                                                                                                                                                                                          • CARMA Performance Distributed Memory
                                                                                                                                                                                                          • CARMA Performance Distributed Memory (2)
                                                                                                                                                                                                          • CARMA Performance Shared Memory
                                                                                                                                                                                                          • CARMA Performance Shared Memory (2)
                                                                                                                                                                                                          • Why is CARMA Faster in Shared Memory
                                                                                                                                                                                                          • Outline (4)
                                                                                                                                                                                                          • One-sided Factorizations (LU QR) so far
                                                                                                                                                                                                          • TSQR An Architecture-Dependent Algorithm
                                                                                                                                                                                                          • Back to LU Using similar idea for TSLU as TSQR Use reduction
                                                                                                                                                                                                          • Minimizing Communication in TSLU
                                                                                                                                                                                                          • Making TSLU Numerically Stable
                                                                                                                                                                                                          • Stability of LU using TSLU CALU
                                                                                                                                                                                                          • Why is stability of TSLU just a ldquoThmrdquo
                                                                                                                                                                                                          • Fixing TSLU
                                                                                                                                                                                                          • 2D CALU with Tournament Pivoting
                                                                                                                                                                                                          • 25D CALU with Tournament Pivoting (c=4 copies)
                                                                                                                                                                                                          • Exascale Machine Parameters Source DOE Exascale Workshop
                                                                                                                                                                                                          • Exascale predicted speedups for Gaussian Elimination 2D CA
                                                                                                                                                                                                          • 25D vs 2D LU With and Without Pivoting
                                                                                                                                                                                                          • Other CA algorithms for Ax=b least squares(13)
                                                                                                                                                                                                          • Other CA algorithms for Ax=b least squares (23)
                                                                                                                                                                                                          • Other CA algorithms for Ax=b least squares (33)
                                                                                                                                                                                                          • Outline (5)
                                                                                                                                                                                                          • What about sparse matrices (13)
                                                                                                                                                                                                          • Performance of 25D APSP using Kleene
                                                                                                                                                                                                          • What about sparse matrices (23)
                                                                                                                                                                                                          • What about sparse matrices (33)
                                                                                                                                                                                                          • Outline (6)
                                                                                                                                                                                                          • Symmetric Eigenproblem and SVD
                                                                                                                                                                                                          • Slide 58
                                                                                                                                                                                                          • Slide 59
                                                                                                                                                                                                          • Slide 60
                                                                                                                                                                                                          • Slide 61
                                                                                                                                                                                                          • Slide 62
                                                                                                                                                                                                          • Slide 63
                                                                                                                                                                                                          • Slide 64
                                                                                                                                                                                                          • Slide 65
                                                                                                                                                                                                          • Slide 66
                                                                                                                                                                                                          • Slide 67
                                                                                                                                                                                                          • Slide 68
                                                                                                                                                                                                          • Conventional vs CA - SBR
                                                                                                                                                                                                          • Speedups of Sym Band Reduction vs DSBTRD
                                                                                                                                                                                                          • Nonsymmetric Eigenproblem
                                                                                                                                                                                                          • Attaining the Lower bounds Sequential
                                                                                                                                                                                                          • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
                                                                                                                                                                                                          • Outline (7)
                                                                                                                                                                                                          • Avoiding Communication in Iterative Linear Algebra
                                                                                                                                                                                                          • Outline (8)
                                                                                                                                                                                                          • Example The Difficulty of Tuning SpMV
                                                                                                                                                                                                          • Example The Difficulty of Tuning
                                                                                                                                                                                                          • Speedups on Itanium 2 The Need for Search
                                                                                                                                                                                                          • Register Profile Itanium 2
                                                                                                                                                                                                          • Register Profiles IBM and Intel IA-64
                                                                                                                                                                                                          • Another example of tuning challenges for SpMV
                                                                                                                                                                                                          • Zoom in to top corner
                                                                                                                                                                                                          • 3x3 blocks look natural buthellip
                                                                                                                                                                                                          • Extra Work Can Improve Efficiency
                                                                                                                                                                                                          • Slide 86
                                                                                                                                                                                                          • Slide 87
                                                                                                                                                                                                          • Slide 88
                                                                                                                                                                                                          • Slide 89
                                                                                                                                                                                                          • Summary of Other Performance Optimizations
                                                                                                                                                                                                          • Optimized Sparse Kernel Interface - OSKI
                                                                                                                                                                                                          • Outline (9)
                                                                                                                                                                                                          • Example Classical Conjugate Gradient (CG)
                                                                                                                                                                                                          • Example CA-Conjugate Gradient
                                                                                                                                                                                                          • Outline (10)
                                                                                                                                                                                                          • Slide 96
                                                                                                                                                                                                          • Slide 97
                                                                                                                                                                                                          • Outline (11)
                                                                                                                                                                                                          • What is a ldquosparse matrixrdquo
                                                                                                                                                                                                          • Outline (12)
                                                                                                                                                                                                          • Reproducible Floating Point Computation
                                                                                                                                                                                                          • Intel MKL non-reproducibility
                                                                                                                                                                                                          • GoalsApproaches for Reproducibility
                                                                                                                                                                                                          • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
                                                                                                                                                                                                          • Collaborators and Supporters
                                                                                                                                                                                                          • Summary

                                                                                                                                                                                                            104

                                                                                                                                                                                                            Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

                                                                                                                                                                                                            Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

                                                                                                                                                                                                            Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

                                                                                                                                                                                                            bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

                                                                                                                                                                                                            Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

                                                                                                                                                                                                            bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

                                                                                                                                                                                                            Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

                                                                                                                                                                                                            Instruments NEC Nokia NVIDIA Samsung Oracle

                                                                                                                                                                                                            bull bebopcsberkeleyedu

                                                                                                                                                                                                            Summary

                                                                                                                                                                                                            Donrsquot Communichellip

                                                                                                                                                                                                            106

                                                                                                                                                                                                            Time to redesign all linear algebra n-body hellip algorithms and software

                                                                                                                                                                                                            (and compilers)

                                                                                                                                                                                                            • Implementing Communication-Avoiding Algorithms
                                                                                                                                                                                                            • Why avoid communication
                                                                                                                                                                                                            • Goals
                                                                                                                                                                                                            • Outline
                                                                                                                                                                                                            • Outline (2)
                                                                                                                                                                                                            • Lower bound for all ldquon3-likerdquo linear algebra
                                                                                                                                                                                                            • Lower bound for all ldquon3-likerdquo linear algebra (2)
                                                                                                                                                                                                            • Lower bound for all ldquon3-likerdquo linear algebra (3)
                                                                                                                                                                                                            • Limits to parallel scaling (12)
                                                                                                                                                                                                            • Limits to parallel scaling (22)
                                                                                                                                                                                                            • Can we attain these lower bounds
                                                                                                                                                                                                            • Outline (3)
                                                                                                                                                                                                            • 25D Matrix Multiplication
                                                                                                                                                                                                            • 25D Matrix Multiplication (2)
                                                                                                                                                                                                            • 25D Matmul on BGP 16K nodes 64K cores (2)
                                                                                                                                                                                                            • Perfect Strong Scaling ndash in Time and Energy (12)
                                                                                                                                                                                                            • Perfect Strong Scaling ndash in Time and Energy (22)
                                                                                                                                                                                                            • Handling Heterogeneity
                                                                                                                                                                                                            • Application to Tensor Contractions
                                                                                                                                                                                                            • C(ijk) = Σm A(ijm)B(mk)
                                                                                                                                                                                                            • Application to Tensor Contractions (2)
                                                                                                                                                                                                            • Communication Lower Bounds for Strassen-like matmul algorithms
                                                                                                                                                                                                            • vs
                                                                                                                                                                                                            • Slide 26
                                                                                                                                                                                                            • Strassen-like beyond matmul
                                                                                                                                                                                                            • Cache and Network Oblivious Algorithms
                                                                                                                                                                                                            • CARMA Performance Distributed Memory
                                                                                                                                                                                                            • CARMA Performance Distributed Memory (2)
                                                                                                                                                                                                            • CARMA Performance Shared Memory
                                                                                                                                                                                                            • CARMA Performance Shared Memory (2)
                                                                                                                                                                                                            • Why is CARMA Faster in Shared Memory
                                                                                                                                                                                                            • Outline (4)
                                                                                                                                                                                                            • One-sided Factorizations (LU QR) so far
                                                                                                                                                                                                            • TSQR An Architecture-Dependent Algorithm
                                                                                                                                                                                                            • Back to LU Using similar idea for TSLU as TSQR Use reduction
                                                                                                                                                                                                            • Minimizing Communication in TSLU
                                                                                                                                                                                                            • Making TSLU Numerically Stable
                                                                                                                                                                                                            • Stability of LU using TSLU CALU
                                                                                                                                                                                                            • Why is stability of TSLU just a ldquoThmrdquo
                                                                                                                                                                                                            • Fixing TSLU
                                                                                                                                                                                                            • 2D CALU with Tournament Pivoting
                                                                                                                                                                                                            • 25D CALU with Tournament Pivoting (c=4 copies)
                                                                                                                                                                                                            • Exascale Machine Parameters Source DOE Exascale Workshop
                                                                                                                                                                                                            • Exascale predicted speedups for Gaussian Elimination 2D CA
                                                                                                                                                                                                            • 25D vs 2D LU With and Without Pivoting
                                                                                                                                                                                                            • Other CA algorithms for Ax=b least squares(13)
                                                                                                                                                                                                            • Other CA algorithms for Ax=b least squares (23)
                                                                                                                                                                                                            • Other CA algorithms for Ax=b least squares (33)
                                                                                                                                                                                                            • Outline (5)
                                                                                                                                                                                                            • What about sparse matrices (13)
                                                                                                                                                                                                            • Performance of 25D APSP using Kleene
                                                                                                                                                                                                            • What about sparse matrices (23)
                                                                                                                                                                                                            • What about sparse matrices (33)
                                                                                                                                                                                                            • Outline (6)
                                                                                                                                                                                                            • Symmetric Eigenproblem and SVD
                                                                                                                                                                                                            • Slide 58
                                                                                                                                                                                                            • Slide 59
                                                                                                                                                                                                            • Slide 60
                                                                                                                                                                                                            • Slide 61
                                                                                                                                                                                                            • Slide 62
                                                                                                                                                                                                            • Slide 63
                                                                                                                                                                                                            • Slide 64
                                                                                                                                                                                                            • Slide 65
                                                                                                                                                                                                            • Slide 66
                                                                                                                                                                                                            • Slide 67
                                                                                                                                                                                                            • Slide 68
                                                                                                                                                                                                            • Conventional vs CA - SBR
                                                                                                                                                                                                            • Speedups of Sym Band Reduction vs DSBTRD
                                                                                                                                                                                                            • Nonsymmetric Eigenproblem
                                                                                                                                                                                                            • Attaining the Lower bounds Sequential
                                                                                                                                                                                                            • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
                                                                                                                                                                                                            • Outline (7)
                                                                                                                                                                                                            • Avoiding Communication in Iterative Linear Algebra
                                                                                                                                                                                                            • Outline (8)
                                                                                                                                                                                                            • Example The Difficulty of Tuning SpMV
                                                                                                                                                                                                            • Example The Difficulty of Tuning
                                                                                                                                                                                                            • Speedups on Itanium 2 The Need for Search
                                                                                                                                                                                                            • Register Profile Itanium 2
                                                                                                                                                                                                            • Register Profiles IBM and Intel IA-64
                                                                                                                                                                                                            • Another example of tuning challenges for SpMV
                                                                                                                                                                                                            • Zoom in to top corner
                                                                                                                                                                                                            • 3x3 blocks look natural buthellip
                                                                                                                                                                                                            • Extra Work Can Improve Efficiency
                                                                                                                                                                                                            • Slide 86
                                                                                                                                                                                                            • Slide 87
                                                                                                                                                                                                            • Slide 88
                                                                                                                                                                                                            • Slide 89
                                                                                                                                                                                                            • Summary of Other Performance Optimizations
                                                                                                                                                                                                            • Optimized Sparse Kernel Interface - OSKI
                                                                                                                                                                                                            • Outline (9)
                                                                                                                                                                                                            • Example Classical Conjugate Gradient (CG)
                                                                                                                                                                                                            • Example CA-Conjugate Gradient
                                                                                                                                                                                                            • Outline (10)
                                                                                                                                                                                                            • Slide 96
                                                                                                                                                                                                            • Slide 97
                                                                                                                                                                                                            • Outline (11)
                                                                                                                                                                                                            • What is a ldquosparse matrixrdquo
                                                                                                                                                                                                            • Outline (12)
                                                                                                                                                                                                            • Reproducible Floating Point Computation
                                                                                                                                                                                                            • Intel MKL non-reproducibility
                                                                                                                                                                                                            • GoalsApproaches for Reproducibility
                                                                                                                                                                                                            • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
                                                                                                                                                                                                            • Collaborators and Supporters
                                                                                                                                                                                                            • Summary

                                                                                                                                                                                                              Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

                                                                                                                                                                                                              Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

                                                                                                                                                                                                              bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

                                                                                                                                                                                                              Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

                                                                                                                                                                                                              bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

                                                                                                                                                                                                              Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

                                                                                                                                                                                                              Instruments NEC Nokia NVIDIA Samsung Oracle

                                                                                                                                                                                                              bull bebopcsberkeleyedu

                                                                                                                                                                                                              Summary

                                                                                                                                                                                                              Donrsquot Communichellip

                                                                                                                                                                                                              106

                                                                                                                                                                                                              Time to redesign all linear algebra n-body hellip algorithms and software

                                                                                                                                                                                                              (and compilers)

                                                                                                                                                                                                              • Implementing Communication-Avoiding Algorithms
                                                                                                                                                                                                              • Why avoid communication
                                                                                                                                                                                                              • Goals
                                                                                                                                                                                                              • Outline
                                                                                                                                                                                                              • Outline (2)
                                                                                                                                                                                                              • Lower bound for all ldquon3-likerdquo linear algebra
                                                                                                                                                                                                              • Lower bound for all ldquon3-likerdquo linear algebra (2)
                                                                                                                                                                                                              • Lower bound for all ldquon3-likerdquo linear algebra (3)
                                                                                                                                                                                                              • Limits to parallel scaling (12)
                                                                                                                                                                                                              • Limits to parallel scaling (22)
                                                                                                                                                                                                              • Can we attain these lower bounds
                                                                                                                                                                                                              • Outline (3)
                                                                                                                                                                                                              • 25D Matrix Multiplication
                                                                                                                                                                                                              • 25D Matrix Multiplication (2)
                                                                                                                                                                                                              • 25D Matmul on BGP 16K nodes 64K cores (2)
                                                                                                                                                                                                              • Perfect Strong Scaling ndash in Time and Energy (12)
                                                                                                                                                                                                              • Perfect Strong Scaling ndash in Time and Energy (22)
                                                                                                                                                                                                              • Handling Heterogeneity
                                                                                                                                                                                                              • Application to Tensor Contractions
                                                                                                                                                                                                              • C(ijk) = Σm A(ijm)B(mk)
                                                                                                                                                                                                              • Application to Tensor Contractions (2)
                                                                                                                                                                                                              • Communication Lower Bounds for Strassen-like matmul algorithms
                                                                                                                                                                                                              • vs
                                                                                                                                                                                                              • Slide 26
                                                                                                                                                                                                              • Strassen-like beyond matmul
                                                                                                                                                                                                              • Cache and Network Oblivious Algorithms
                                                                                                                                                                                                              • CARMA Performance Distributed Memory
                                                                                                                                                                                                              • CARMA Performance Distributed Memory (2)
                                                                                                                                                                                                              • CARMA Performance Shared Memory
                                                                                                                                                                                                              • CARMA Performance Shared Memory (2)
                                                                                                                                                                                                              • Why is CARMA Faster in Shared Memory
                                                                                                                                                                                                              • Outline (4)
                                                                                                                                                                                                              • One-sided Factorizations (LU QR) so far
                                                                                                                                                                                                              • TSQR An Architecture-Dependent Algorithm
                                                                                                                                                                                                              • Back to LU Using similar idea for TSLU as TSQR Use reduction
                                                                                                                                                                                                              • Minimizing Communication in TSLU
                                                                                                                                                                                                              • Making TSLU Numerically Stable
                                                                                                                                                                                                              • Stability of LU using TSLU CALU
                                                                                                                                                                                                              • Why is stability of TSLU just a ldquoThmrdquo
                                                                                                                                                                                                              • Fixing TSLU
                                                                                                                                                                                                              • 2D CALU with Tournament Pivoting
                                                                                                                                                                                                              • 25D CALU with Tournament Pivoting (c=4 copies)
                                                                                                                                                                                                              • Exascale Machine Parameters Source DOE Exascale Workshop
                                                                                                                                                                                                              • Exascale predicted speedups for Gaussian Elimination 2D CA
                                                                                                                                                                                                              • 25D vs 2D LU With and Without Pivoting
                                                                                                                                                                                                              • Other CA algorithms for Ax=b least squares(13)
                                                                                                                                                                                                              • Other CA algorithms for Ax=b least squares (23)
                                                                                                                                                                                                              • Other CA algorithms for Ax=b least squares (33)
                                                                                                                                                                                                              • Outline (5)
                                                                                                                                                                                                              • What about sparse matrices (13)
                                                                                                                                                                                                              • Performance of 25D APSP using Kleene
                                                                                                                                                                                                              • What about sparse matrices (23)
                                                                                                                                                                                                              • What about sparse matrices (33)
                                                                                                                                                                                                              • Outline (6)
                                                                                                                                                                                                              • Symmetric Eigenproblem and SVD
                                                                                                                                                                                                              • Slide 58
                                                                                                                                                                                                              • Slide 59
                                                                                                                                                                                                              • Slide 60
                                                                                                                                                                                                              • Slide 61
                                                                                                                                                                                                              • Slide 62
                                                                                                                                                                                                              • Slide 63
                                                                                                                                                                                                              • Slide 64
                                                                                                                                                                                                              • Slide 65
                                                                                                                                                                                                              • Slide 66
                                                                                                                                                                                                              • Slide 67
                                                                                                                                                                                                              • Slide 68
                                                                                                                                                                                                              • Conventional vs CA - SBR
                                                                                                                                                                                                              • Speedups of Sym Band Reduction vs DSBTRD
                                                                                                                                                                                                              • Nonsymmetric Eigenproblem
                                                                                                                                                                                                              • Attaining the Lower bounds Sequential
                                                                                                                                                                                                              • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
                                                                                                                                                                                                              • Outline (7)
                                                                                                                                                                                                              • Avoiding Communication in Iterative Linear Algebra
                                                                                                                                                                                                              • Outline (8)
                                                                                                                                                                                                              • Example The Difficulty of Tuning SpMV
                                                                                                                                                                                                              • Example The Difficulty of Tuning
                                                                                                                                                                                                              • Speedups on Itanium 2 The Need for Search
                                                                                                                                                                                                              • Register Profile Itanium 2
                                                                                                                                                                                                              • Register Profiles IBM and Intel IA-64
                                                                                                                                                                                                              • Another example of tuning challenges for SpMV
                                                                                                                                                                                                              • Zoom in to top corner
                                                                                                                                                                                                              • 3x3 blocks look natural buthellip
                                                                                                                                                                                                              • Extra Work Can Improve Efficiency
                                                                                                                                                                                                              • Slide 86
                                                                                                                                                                                                              • Slide 87
                                                                                                                                                                                                              • Slide 88
                                                                                                                                                                                                              • Slide 89
                                                                                                                                                                                                              • Summary of Other Performance Optimizations
                                                                                                                                                                                                              • Optimized Sparse Kernel Interface - OSKI
                                                                                                                                                                                                              • Outline (9)
                                                                                                                                                                                                              • Example Classical Conjugate Gradient (CG)
                                                                                                                                                                                                              • Example CA-Conjugate Gradient
                                                                                                                                                                                                              • Outline (10)
                                                                                                                                                                                                              • Slide 96
                                                                                                                                                                                                              • Slide 97
                                                                                                                                                                                                              • Outline (11)
                                                                                                                                                                                                              • What is a ldquosparse matrixrdquo
                                                                                                                                                                                                              • Outline (12)
                                                                                                                                                                                                              • Reproducible Floating Point Computation
                                                                                                                                                                                                              • Intel MKL non-reproducibility
                                                                                                                                                                                                              • GoalsApproaches for Reproducibility
                                                                                                                                                                                                              • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
                                                                                                                                                                                                              • Collaborators and Supporters
                                                                                                                                                                                                              • Summary

                                                                                                                                                                                                                Summary

                                                                                                                                                                                                                Donrsquot Communichellip

                                                                                                                                                                                                                106

                                                                                                                                                                                                                Time to redesign all linear algebra n-body hellip algorithms and software

                                                                                                                                                                                                                (and compilers)

                                                                                                                                                                                                                • Implementing Communication-Avoiding Algorithms
                                                                                                                                                                                                                • Why avoid communication
                                                                                                                                                                                                                • Goals
                                                                                                                                                                                                                • Outline
                                                                                                                                                                                                                • Outline (2)
                                                                                                                                                                                                                • Lower bound for all ldquon3-likerdquo linear algebra
                                                                                                                                                                                                                • Lower bound for all ldquon3-likerdquo linear algebra (2)
                                                                                                                                                                                                                • Lower bound for all ldquon3-likerdquo linear algebra (3)
                                                                                                                                                                                                                • Limits to parallel scaling (12)
                                                                                                                                                                                                                • Limits to parallel scaling (22)
                                                                                                                                                                                                                • Can we attain these lower bounds
                                                                                                                                                                                                                • Outline (3)
                                                                                                                                                                                                                • 25D Matrix Multiplication
                                                                                                                                                                                                                • 25D Matrix Multiplication (2)
                                                                                                                                                                                                                • 25D Matmul on BGP 16K nodes 64K cores (2)
                                                                                                                                                                                                                • Perfect Strong Scaling ndash in Time and Energy (12)
                                                                                                                                                                                                                • Perfect Strong Scaling ndash in Time and Energy (22)
                                                                                                                                                                                                                • Handling Heterogeneity
                                                                                                                                                                                                                • Application to Tensor Contractions
                                                                                                                                                                                                                • C(ijk) = Σm A(ijm)B(mk)
                                                                                                                                                                                                                • Application to Tensor Contractions (2)
                                                                                                                                                                                                                • Communication Lower Bounds for Strassen-like matmul algorithms
                                                                                                                                                                                                                • vs
                                                                                                                                                                                                                • Slide 26
                                                                                                                                                                                                                • Strassen-like beyond matmul
                                                                                                                                                                                                                • Cache and Network Oblivious Algorithms
                                                                                                                                                                                                                • CARMA Performance Distributed Memory
                                                                                                                                                                                                                • CARMA Performance Distributed Memory (2)
                                                                                                                                                                                                                • CARMA Performance Shared Memory
                                                                                                                                                                                                                • CARMA Performance Shared Memory (2)
                                                                                                                                                                                                                • Why is CARMA Faster in Shared Memory
                                                                                                                                                                                                                • Outline (4)
                                                                                                                                                                                                                • One-sided Factorizations (LU QR) so far
                                                                                                                                                                                                                • TSQR An Architecture-Dependent Algorithm
                                                                                                                                                                                                                • Back to LU Using similar idea for TSLU as TSQR Use reduction
                                                                                                                                                                                                                • Minimizing Communication in TSLU
                                                                                                                                                                                                                • Making TSLU Numerically Stable
                                                                                                                                                                                                                • Stability of LU using TSLU CALU
                                                                                                                                                                                                                • Why is stability of TSLU just a ldquoThmrdquo
                                                                                                                                                                                                                • Fixing TSLU
                                                                                                                                                                                                                • 2D CALU with Tournament Pivoting
                                                                                                                                                                                                                • 25D CALU with Tournament Pivoting (c=4 copies)
                                                                                                                                                                                                                • Exascale Machine Parameters Source DOE Exascale Workshop
                                                                                                                                                                                                                • Exascale predicted speedups for Gaussian Elimination 2D CA
                                                                                                                                                                                                                • 25D vs 2D LU With and Without Pivoting
                                                                                                                                                                                                                • Other CA algorithms for Ax=b least squares(13)
                                                                                                                                                                                                                • Other CA algorithms for Ax=b least squares (23)
                                                                                                                                                                                                                • Other CA algorithms for Ax=b least squares (33)
                                                                                                                                                                                                                • Outline (5)
                                                                                                                                                                                                                • What about sparse matrices (13)
                                                                                                                                                                                                                • Performance of 25D APSP using Kleene
                                                                                                                                                                                                                • What about sparse matrices (23)
                                                                                                                                                                                                                • What about sparse matrices (33)
                                                                                                                                                                                                                • Outline (6)
                                                                                                                                                                                                                • Symmetric Eigenproblem and SVD
                                                                                                                                                                                                                • Slide 58
                                                                                                                                                                                                                • Slide 59
                                                                                                                                                                                                                • Slide 60
                                                                                                                                                                                                                • Slide 61
                                                                                                                                                                                                                • Slide 62
                                                                                                                                                                                                                • Slide 63
                                                                                                                                                                                                                • Slide 64
                                                                                                                                                                                                                • Slide 65
                                                                                                                                                                                                                • Slide 66
                                                                                                                                                                                                                • Slide 67
                                                                                                                                                                                                                • Slide 68
                                                                                                                                                                                                                • Conventional vs CA - SBR
                                                                                                                                                                                                                • Speedups of Sym Band Reduction vs DSBTRD
                                                                                                                                                                                                                • Nonsymmetric Eigenproblem
                                                                                                                                                                                                                • Attaining the Lower bounds Sequential
                                                                                                                                                                                                                • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
                                                                                                                                                                                                                • Outline (7)
                                                                                                                                                                                                                • Avoiding Communication in Iterative Linear Algebra
                                                                                                                                                                                                                • Outline (8)
                                                                                                                                                                                                                • Example The Difficulty of Tuning SpMV
                                                                                                                                                                                                                • Example The Difficulty of Tuning
                                                                                                                                                                                                                • Speedups on Itanium 2 The Need for Search
                                                                                                                                                                                                                • Register Profile Itanium 2
                                                                                                                                                                                                                • Register Profiles IBM and Intel IA-64
                                                                                                                                                                                                                • Another example of tuning challenges for SpMV
                                                                                                                                                                                                                • Zoom in to top corner
                                                                                                                                                                                                                • 3x3 blocks look natural buthellip
                                                                                                                                                                                                                • Extra Work Can Improve Efficiency
                                                                                                                                                                                                                • Slide 86
                                                                                                                                                                                                                • Slide 87
                                                                                                                                                                                                                • Slide 88
                                                                                                                                                                                                                • Slide 89
                                                                                                                                                                                                                • Summary of Other Performance Optimizations
                                                                                                                                                                                                                • Optimized Sparse Kernel Interface - OSKI
                                                                                                                                                                                                                • Outline (9)
                                                                                                                                                                                                                • Example Classical Conjugate Gradient (CG)
                                                                                                                                                                                                                • Example CA-Conjugate Gradient
                                                                                                                                                                                                                • Outline (10)
                                                                                                                                                                                                                • Slide 96
                                                                                                                                                                                                                • Slide 97
                                                                                                                                                                                                                • Outline (11)
                                                                                                                                                                                                                • What is a ldquosparse matrixrdquo
                                                                                                                                                                                                                • Outline (12)
                                                                                                                                                                                                                • Reproducible Floating Point Computation
                                                                                                                                                                                                                • Intel MKL non-reproducibility
                                                                                                                                                                                                                • GoalsApproaches for Reproducibility
                                                                                                                                                                                                                • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
                                                                                                                                                                                                                • Collaborators and Supporters
                                                                                                                                                                                                                • Summary

                                                                                                                                                                                                                  top related